Home

Awesome

Parameter Efficient Multimodal Transformers for Video Representation Learning

This repository contains the code and models for our ICLR 2021 paper:

Parameter Efficient Multimodal Transformers for Video Representation Learning <br> Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song <br> [paper] [poster] [slides]

@inproceedings{lee2021avbert,
    title="{Parameter Efficient Multimodal Transformers for Video Representation Learning}",
    author={Sangho Lee and Youngjae Yu and Gunhee Kim and Thomas Breuel and Jan Kautz and Yale Song},
    booktitle={ICLR},
    year=2021
}

System Requirements

Installation

  1. Install PyTorch 1.6.0, torchvision 0.7.0 and torchaudio 0.6.0 for your environment. Follow the instructions in HERE.

  2. Install other required packages.

pip install -r requirements.txt

Download Data

python download_ucf101.py
python download_esc50.py
python download_ks.py
python download_checkpoint.py

Experiments

To run experiments with a single GPU.

UCF101 (split: 1, 2 or 3)

cd code
python run_net.py \
    --cfg_file configs/ucf101/config.yaml \
    --configuration ucf101 \
    --pretrain_checkpoint_path checkpoints/checkpoint.pyth \
    TRAIN.DATASET_SPLIT <split>
    TEST.DATASET_SPLIT <split>

ESC-50 (split: 1, 2, 3, 4 or 5)

cd code
python run_net.py \
    --cfg_file configs/esc50/config.yaml \
    --configuration esc50 \
    --pretrain_checkpoint_path checkpoints/checkpoint.pyth \
    TRAIN.DATASET_SPLIT <split>
    TEST.DATASET_SPLIT <split>

Kinetics-Sounds

cd code
python run_net.py \
    --cfg_file configs/kinetics-sounds/config.yaml \
    --configuration kinetics-sounds \
    --pretrain_checkpoint_path checkpoints/checkpoint.pyth

After submission, we further adjusted hyperparameters and achieved the following results.

DatasetTop-1 AccuracyTop-5 Accuracy
UCF10187.597.4
ESC-5085.996.9
Kinetis-Sounds85.897.8

Acknowledgments

This source code is based on PySlowFast.