Awesome
Preserving Modality Structure Improves Multi-Modal Learning
Swetha, Sirnam, Rizve, Mamshad Nayeem, Shvetsova, Nina, Kuehne, Hilde and Shah, Mubarak
Accepted at ICCV 2023!
This repo is official implementation of Preserving Modality Structure Improves Multi-Modal Learning
Repository contains:
- Training Code
- Model Weights [Todo]
- Fine-tuning and evaluation datasets: MSR-VTT and YouCook2
Get started
- Create an environment:
conda create python=3.8 -y -n multisk conda activate multisk conda install -y pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch pip install gensim==3.8.0 sacred==0.8.2 humanize==3.14.0 transformers==4.10.2 librosa==0.8.1 timm==0.4.12
- If needed, download
data.tar
with features and spectrograms to fine-tune and evaluate on YouCook2 and MSR-VTT here. Extract a tar:tar -xvf data.tar
Pretraining
-
Downloading HowTo100M and feature extraction. Please note that HowTo100M videos require a huge storage, and features alone take up terabytes of space. Features extraction (ResNet-152,ResNeXt-101) and audio spectrogram extraction were carefully described in https://github.com/roudimit/AVLnet/blob/main/training.md.
-
Review
configs/pretraining/resnet_tva.yaml
and make surecsv
,features_path
,features_path_audio
, andcaption_path
point on the correct paths. CSV file should contain one column named 'path' with a list of videos. An example of the CSV file that we used in the training can be found here (HowTo100M_1166_videopaths.txt). -
Train
python train.py --config configs/pretraining/resnet_tva.yaml
Using the model on your own data
If you want to use the model on your own data, please follow steps described in https://github.com/roudimit/AVLnet for features extraction and audio spectrogram extraction.
Cite
If you use this code in your research, please cite:
@InProceedings{Swetha_2023_ICCV,
author = {Swetha, Sirnam and Rizve, Mamshad Nayeem and Shvetsova, Nina and Kuehne, Hilde and Shah, Mubarak},
title = {Preserving Modality Structure Improves Multi-Modal Learning},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {21993-22003}
}
Contact
If you have any problems with the code or have a question, please open an issue or email swetha(dot)sirnam at ucf.edu. I'll try to answer as soon as possible.
Acknowledgments and Licenses
The main structure of the code is based on everything-at-once which is built upon frozen-in-time.
The code in davenet.py
, layers.py
, avlnet.py
is partly derived from https://github.com/dharwath/DAVEnet-pytorch/, https://github.com/wnhsu/ResDAVEnet-VQ, https://github.com/antoine77340/howto100m, and https://github.com/roudimit/AVLnet, and is licensed under BSD-3 (David Harwath, Wei-Ning Hsu, Andrew Rouditchenko) and Apache License 2.0 (Antoine Miech).