Home

Awesome

Concatenated Masked Autoencoders as Spatial-Temporal Learner: A PyTorch Implementation

<p align="center"> <img src="https://github.com/minhoooo1/CatMAE/blob/master/figures/arch.png" width="800"> </p>

This is a PyTorch re-implementation of the paper Concatenated Masked Autoencoders as Spatial-Temporal Learner:

Requirements

Data Preparation

We use two datasets, Kinetics-400 and DAVIS-2017, for training and downstream tasks in total.

Pre-training

The arguments set in the config_file will be used first

To pre-train CatMAE-ViT-Small, run the following commond:

python main_pretrain.py --config_file configs/pretrain_catmae_vit-s-16.json

Some important arguments

Pre-trained checkpoints

The following table provides the pre-trained checkpoints used in the paper

<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="bottom"></th> <th valign="bottom">ViT/16-Small</th> <th valign="bottom">ViT/8-Small</th> <!-- TABLE BODY --> <tr><td align="left">pre-trained checkpoint</td> <td align="center"><a href="https://drive.google.com/file/d/1xWrpSxZy6d3r_XnsZmXvqM1XUReJ7v97/view?usp=drive_link">download</a></td> <td align="center"><a href="https://drive.google.com/file/d/1ksYZJPa2pZ-NYWjYKLh05-bt_A40Rhm7/view?usp=drive_link">download</a></td> </tr> <tr><td align="left">DAVIS 2017 J&Fm</td> <td align="center">62.5</td> <td align="center">70.4</td> </tr> </tbody></table>

Video segment in DAVIS-2017

The Video segment instruction is in DAVIS.md.

Action recognition in Kinetics-400

The Action recognition instruction is in KINETICS400.md.