Home

Awesome

AIM: Adapting Image Models for Efficient Video Action Recognition

This repo is the official implementation of "AIM: Adapting Image Models for Efficient Video Action Recognition" at ICLR 2023.

If you find our work useful in your research, please cite:

@inproceedings{
    yang2023aim,
    title={{AIM}: Adapting Image Models for Efficient Video Action Recognition},
    author={Taojiannan Yang and Yi Zhu and Yusheng Xie and Aston Zhang and Chen Chen and Mu Li},
    booktitle={The Eleventh International Conference on Learning Representations },
    year={2023},
    url={https://openreview.net/forum?id=CIoSZ_HKHS7}
}

Introduction

In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. The overall structure of the proposed method is shown in the figure below.

<p><img src="figures/overallstructure.png" width="800" /></p>

During training, only Adapters are updated, which largely saves the training cost while still achieve competitive performance with SoTA full finetuned video models. As shown in the figure below, AIM outperforms previous SoTA methods while using less number of tunable parameters and inference GFLOPs.

<p><img src="figures/overallperformance.png" width="500" /></p>

Installation

The codes are based on VideoSwin, which is based on MMAction2. To prepare the environment, please follow the following instructions.

# create virtual environment
conda create -n AIM python=3.7.13
conda activate AIM

# install pytorch
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge

# install CLIP
pip install git+https://github.com/openai/CLIP.git

# install other requirements
pip install -r requirements.txt

# install mmaction2
python setup.py develop

Install Apex:

We use apex for mixed precision training by default. To install apex, please follow the instructions in the repo.

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Data Preparation

The codes are based on MMAction2. You can refer to MMAction2 for a general guideline on how to prepare the data. All the datasets (K400, K700, SSv2 and Diving-48) used in this work are supported in MMAction2.

Training

The training configs of different experiments are provided in configs/recognition/vit/. To run experiments, please use the following command. PATH/TO/CONFIG is the training config you want to use. The default training setting is 8GPU with a batchsize of 64.

bash tools/dist_train.sh <PATH/TO/CONFIG> <NUM_GPU> --test-last --validate --cfg-options model.backbone.pretrained=openaiclip work_dir=<PATH/TO/OUTPUT>

We also provide a training script in run_exp.sh. You can simply change the training config to train different models.

Key Files

Evaluation

The code will do the evaluation after training. If you would like to evaluate a model only, please use the following command,

bash tools/dist_test.sh <PATH/TO/CONFIG> <CHECKPOINT_FILE> <NUM_GPU> --eval top_k_accuracy

Models

Kinetics 400

BackbonePretrainGFLOPsParamTunable Paramacc@1acc@5ViewsCheckpoint
ViT-B/16CLIP606971183.996.38x3x1checkpoint
ViT-B/16CLIP1214971184.596.616x3x1checkpoint
ViT-B/16CLIP2428971184.796.732x3x1checkpoint
ViT-L/14CLIP29023413886.897.28x3x1checkpoint
ViT-L/14CLIP56043413887.397.616x3x1checkpoint
ViT-L/14CLIP112083413887.597.732x3x1checkpoint

Kinetics 700

BackbonePretrainGFLOPsParamTunable Paramacc@1ViewsCheckpoint
ViT-B/16CLIP7284971176.932x3x3checkpoint
ViT-L/14CLIP336243413880.432x3x3

Diving-48

BackbonePretrainGFLOPsParamTunable Paramacc@1ViewsCheckpoint
ViT-B/16CLIP809971188.932x1x1checkpoint
ViT-L/14CLIP37363413890.632x1x1

TODO

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.