Home

Awesome

This repo has the implementation of our Interspeech 2021 papers: AVLnet: Learning Audio-Visual Language Representations from Instructional Videos [1] and Cascaded Multilingual Audio-Visual Learning from Videos [2]. Our website avlnet.csail.mit.edu has an audio-video retrieval demo.

AVLnet (Audio-Video Language Network) is trained on the audio-video pairs from the HowTo100M dataset, and can be used for video clip retrieval using raw speech audio and natural sounds, without needing to transcribe speech to text. AVLnet-Text integrates a text branch and is trained on audio, video, and text from the HowTo100M dataset. It can be used for text to video retrieval on standard video and language datasets. We propose two versions of the model, AVLnet-Text-Tri which keeps the three branches separate so that any two modalities can be compared, and AVLnet-Text-Fused which fuses the audio and text branches due to the complementary information in audio and text.

To learn multilingual representations, we propose a cascaded approach that applies the AVLnet model trained on English videos to videos in Japanese. We collected a dataset of instructional cooking videos in Japanese, named YouCook-Japanese. Applying our cascaded approach, we show an improvement in retrieval performance of nearly 10x on YouCook-Japanese compared to training on the Japanese videos solely.

Instructions

In this repo, we provide everything necessary to evaluate and fine-tune our models already trained on HowTo100M (pretrained weights provided). The instructions for training on HowTo100M are in training.md.

Currently, we provide:

Requirements

We recommend installing the following packages in a fresh anaconda environment. Note that the evaluation code will run without Librosa and Apex. Our training code will also run without Apex, but we have only tested it using Apex with mixed-precision.

Download the Model Weights and Data

wget https://www.dropbox.com/sh/bd75sz4m734xs0z/AADGydRa_0QClNmGXEtBOoKca/AVLnet_release_models.tar.gz?dl=0
tar -xvf 'AVLnet_release_models.tar.gz?dl=0'
mkdir model
mv AVLnet_release model
wget https://www.dropbox.com/sh/bd75sz4m734xs0z/AADUY_-IqGWx9NiiXb6ae304a/AVLnet_release_data.tar.gz?dl=0
tar -xvf 'AVLnet_release_data.tar.gz?dl=0'
wget https://www.dropbox.com/s/4tqokt8pp53gjjp/YouCook_Japanese.tar.gz?dl=0
tar -xvf 'YouCook_Japanese.tar.gz?dl=0' && mv YouCook_Japanese data

General Code Notes

AVLnet Training and Evaluation on YouCook2, MSR-VTT, CrossTask, and YouCook-Japanese

YouCook2

Evaluate and fine-tune the HowTo100M-trained model on YouCook2:

python train.py --youcook=1 --eval_youcook=1 --num_thread_reader=8 --batch_size=256 --epochs=5 --lr_decay=1.0 --embd_dim=4096  --pretrain_path=model/AVLnet_release/AVLnet_release.pth

Train from scratch on YouCook2:

python train.py --youcook=1 --eval_youcook=1 --num_thread_reader=8 --batch_size=64 --epochs=15 --lr=1e-4 --lr_decay=1.0 --embd_dim=4096

MSR-VTT

Evaluate and fine-tune the HowTo100M-trained model on MSR-VTT:

python train.py --msrvtt=1 --eval_msrvtt=1 --num_thread_reader=8 --batch_size=256 --epochs=5 --lr_decay=1.0 --embd_dim=4096  --pretrain_path=model/AVLnet_release/AVLnet_release.pth

Train from scratch on MSR-VTT:

python train.py --msrvtt=1 --eval_msrvtt=1 --num_thread_reader=8 --batch_size=64 --epochs=15 --lr_decay=1.0

CrossTask

Evaluate and fine-tune the HowTo100M-trained model on CrossTask:

python train.py --youcook=1 --eval_youcook=1 --num_thread_reader=8 --batch_size=256 --epochs=5 --lr_decay=1.0 --embd_dim=4096  --pretrain_path=model/AVLnet_release/AVLnet_release.pth --youcook_train_path=data/crosstask_clips_train.pkl --youcook_val_path=data/crosstask_clips_val.pkl

Train from scratch on CrossTask:

python train.py --youcook=1 --eval_youcook=1 --num_thread_reader=8 --batch_size=64 --epochs=15 --lr=1e-4 --lr_decay=1.0 --embd_dim=4096 --youcook_train_path=data/crosstask_clips_train.pkl --youcook_val_path=data/crosstask_clips_val.pkl

YouCook-Japanese

Evaluate and fine-tune the HowTo100M-trained model on YouCook-Japanese: (Note: the validation set is provided as youcook_japanese_val.pkl and should be used for hyperparameter tuning)

python train.py --youcook=1 --eval_youcook=1 --num_thread_reader=8 --batch_size=256 --epochs=5 --lr_decay=1.0 --embd_dim=4096 --pretrain_path=model/AVLnet_release/AVLnet_release.pth --youcook_train_path=data/YouCook_Japanese/youcook_japanese_train.pkl --youcook_val_path=data/YouCook_Japanese/youcook_japanese_eval.pkl  

Train from scratch on YouCook-Japanese:

python train.py --youcook=1 --eval_youcook=1 --num_thread_reader=8 --batch_size=64 --epochs=15 --lr_decay=1.0 --embd_dim=4096 --lr=1e-4 --youcook_train_path=data/YouCook_Japanese/youcook_japanese_train.pkl --youcook_val_path=data/YouCook_Japanese/youcook_japanese_eval.pkl  

AVLnet-Text Evaluation and Fine-tuning

Please see our paper for the difference between AVLnet-Text-Tri and AVLnet-Text-Fused. AVLnet-Text-Tri performs T->A+V retrieval and AVLnet-Text-Fused performs T+A->V retrieval.

AVLnet-Text-Tri

Note the --fuse_videoaudio_additive=1 flag (check args.py for details).

Evaluate and fine-tune the HowTo100M-trained model on YouCook2:

python train.py --youcook=1 --eval_youcook=1 --num_thread_reader=8 --batch_size=256 --epochs=3 --tri_modal=1 --fuse_videoaudio_additive=1 --lr=1e-4 --lr_decay=0.9 --embd_dim=6144 --pretrain_path=model/AVLnet_release/AVLnet_Text_Tri_release.pth 

Evaluate and fine-tune the HowTo100M-trained model on MSR-VTT:

python train.py --msrvtt=1 --eval_msrvtt=1 --num_thread_reader=8 --batch_size=256 --epochs=15 --tri_modal=1 --fuse_videoaudio_additive=1 --lr=1e-4 --lr_decay=1.0 --embd_dim=6144 --pretrain_path=model/AVLnet_release/AVLnet_Text_Tri_release.pth 

AVLnet-Text-Fused

Evaluate and fine-tune the HowTo100M-trained model on YouCook2:

python train.py --youcook=1 --eval_youcook=1 --num_thread_reader=8 --batch_size=256 --epochs=5 --lr_decay=1.0 --embd_dim=4096  --pretrain_path=model/AVLnet_release/AVLnet_Text_Fused_release.pth --lr=1e-5 --tri_modal_fuse=1 --tri_modal=1

Evaluate and fine-tune the HowTo100M-trained model on MSR-VTT:

python train.py --msrvtt=1 --eval_msrvtt=1 --num_thread_reader=8 --batch_size=256 --epochs=5 --lr_decay=1.0 --embd_dim=4096  --pretrain_path=model/AVLnet_release/AVLnet_Text_Fused_release.pth --lr=1e-5 --tri_modal_fuse=1 --tri_modal=1

Use the model on your own videos

References

[1] Andrew Rouditchenko*, Angie Boggust*, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. Interspeech 2021.

[2] Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass. Cascaded Multilingual Audio-Visual Learning from Videos. Interspeech 2021.

AVLnet - Bibtex:

@article{rouditchenko2020avlnet,
  title={Avlnet: Learning audio-visual language representations from instructional videos},
  author={Rouditchenko, Andrew and Boggust, Angie and Harwath, David and Chen, Brian and Joshi, Dhiraj and Thomas, Samuel and Audhkhasi, Kartik and Kuehne, Hilde and Panda, Rameswar and Feris, Rogerio and others},
  journal={arXiv preprint arXiv:2006.09199},
  year={2020}
}

Cascaded Multilingual - Bibtex:

@article{rouditchenko2021cascaded,
  title={Cascaded Multilingual Audio-Visual Learning from Videos},
  author={Rouditchenko, Andrew and Boggust, Angie and Harwath, David and Thomas, Samuel and Kuehne, Hilde and Chen, Brian and Panda, Rameswar and Feris, Rogerio and Kingsbury, Brian and Picheny, Michael and others},
  journal={Proc. Interspeech 2021},
  pages={3006--3010},
  year={2021}
}

Contact

If You find any problems or have any questions, please open an issue and I will try to respond as soon as possible.

Acknowledgments and Licenses

The main structure of our code is adopted from Antoine Miech's original HowTo100M training code (https://github.com/antoine77340/howto100m). All code derived from there is licensed under Apache License 2.0 (Antoine Miech).

The code in model_davenet.py is partly derived from https://github.com/dharwath/DAVEnet-pytorch/ and https://github.com/wnhsu/ResDAVEnet-VQ and is licensed under BSD-3 (David Harwath and Wei-Ning Hsu).

All other code is licensed under BSD-3 (Andrew Rouditchenko).

All license clauses are in the LICENSE file.