Home

Awesome

[NeurIPS 2022] ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

This is the official repo of the paper ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning.

@article{pan2022st,
  title={ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition},
  author={Pan, Junting and Lin, Ziyi and Zhu, Xiatian and Shao, Jing and Li, Hongsheng},
  journal={arXiv preprint arXiv:2206.13559},
  year={2022}
}

Environment

We use conda to manage the Python environment. The dumped configuration is provided at environment.yml

Configuration

Some common configurations (e.g., dataset paths, pretrained backbone paths) are set in config.py. We've included an example configuration in config.py.example which contains all required fields with values left empty. Please copy config.py.example to config.py and fill in the values before running the models.

Dataset preparation

The data list should be organized as follows

<video_1> <label_1>
<video_2> <label_2>
...
<video_N> <label_N>

where <video_i> is the path to a video file, and <label_i> is an integer between $0$ and $M-1$ representing the class of the $i$-th video, where $M$ is the total number of classes.

We release the data list we used for Kinetics-400 (k400, train list link, val list link) and Something-something-v2 (ssv2, train list link, val list link), which reflect the class mapping of the released models and the videos available at our side. It is strongly recommended that the Kinetics-400 lists be cleaned first, as some videos may have been taken down by YouTube for various reasons (the training will stop on broken videos in the current implementation).

After obtaining the videos and the data lists, set the root dir and the list paths in config.py in the DATASETS dictionary (fill in the blanks for k400 and ssv2 or add new items for custom datasets). For each dataset, 5 fields are required:

Backbone preparation

We use the CLIP checkpoints from the official release. Put the downloaded checkpoint paths in config.py. The currently supported architectures are CLIP-ViT-B/16 (set CLIP_VIT_B16_PATH) and CLIP-ViT-L/14 (set CLIP_VIT_B16_PATH).

Run the models

We provide some preset scripts in the scripts/ directory containing some recommended settings. For a detailed description of the comand line arguments see the help message of main.py.

Model zoo

The adapter architecture is described as (# adapter layers x # bottleneck channels). This is a reproduced code, so the accuracy of the checkpoints may slightly differ from the numbers reported in the paper. More models are to be released soon.

Something-something-v2

Backbone arch.Adapter arch.Acc. 1 (%)Links
CLIP ViT-B/1624x38466.9script log checkpoint

Kinetics-400

Backbone arch.Adapter arch.Acc. 1 (%)Links
CLIP ViT-B/1612x38482.2script log checkpoint

Acknowledgements

The CLIP model implementation is modified from CLIP official repo. Some data processing code comes from PySlowFast. Part of the training code comes from MAE. Thanks for their awesome works!