Awesome
Official PyTorch Implementation of MME (CVPR2023)
Masked Motion Encoding for Self-Supervised Video Representation Learning
Xinyu Sun, Peihao Chen, Liangwei Chen, Changhao Li, Thomas H. Li, Mingkui Tan, Chuang Gan
<span id="main-results"></span>
Main Results
UCF101 & HMDB51
Method | Pre-train Data | Fine-tune Data | Backbone | Acc@1 | Download Link |
---|---|---|---|---|---|
MME | K400 | UCF101 | ViT-B | 96.5 | log/cow/google |
MME | K400 | HMDB51 | ViT-B | 78.0 | log/cow/google |
Kinetics-400 (K400)
Method | Pre-train Data | Backbone | Epoch | #Frames x Clips x Crops | Acc@1 | Download Link |
---|---|---|---|---|---|---|
MME | K400 | ViT-B | 1600 | 16x7x3 | 81.8 | log/cow/google |
Something-Something V2 (SSV2)
Method | Pre-train Data | Backbone | Epoch | #Frames x Clips x Crops | Acc@1 | Download Link |
---|---|---|---|---|---|---|
MME | SSV2 | ViT-B | 400 | 16x2x3 | 69.2 | log/cow/google |
MME | K400 | ViT-B | 400 | 16x2x3 | 69.5 | log/cow/google |
MME | K400 | ViT-B | 800 | 16x2x3 | 70.5 | log/cow/google |
MME | K400 | ViT-B | 1600 | 16x2x3 | 71.5 | log/cow/google |
Model Zoo
Pre-trained Weight
Method | Pre-train Data | Backbone | Epoch | Download Link |
---|---|---|---|---|
MME | K400 | ViT-B | 1600 | google/cow |
MME | K400 | ViT-B | 800 | google/cow |
MME | SSV2 | ViT-B | 400 | [TODO] |
Prepare Environment
Run install.sh to create environment and install packages.
cd ${MME_FOLDER_BASE} # cd the code base of MME
export CUDA_HOME=${PATH_TO_CUDA}
export PYTHONPATH=$PYTHONPATH:`pwd`
source scripts/tools/install.sh
Troubleshooting: replace the compiled cuda version of torch/torchvision in the install.sh with your installed cuda version.
Prepare Datasets
In this step, we will put the dataset folder into this work space. In our experiments, we use Kinetics-400
, Something-Something V2
, UCF101
, and HMDB51
four datasets.
mkdir data
ln -s ${PATH_TO_DATASET} data/
After this process, the data folder should be organized as follows:
data/
├── csv
│ └── k400
│ ├── train.csv
│ ├── val.csv
│ └── test.csv
├── kinetics400
│ ├── train_video
│ │ └── abseiling
│ │ └── ztuc7tVNUDo_000003_000013.mp4
│ └── val_video
├── smth-smth-v2
│ ├── 20bn-something-something-v2
│ │ └── 8192.mp4
│ └── annotations
├── UCF101
│ ├── UCF-101
│ │ └── ApplyEyeMakeup
│ │ └── v_ApplyEyeMakeup_g24_c05.avi
│ └── ucfTrainTestlist
└── hmdb51
├── videos
│ └── brush_hair
│ └── Silky_Straight_Hair_Original_brush_hair_h_nm_np1_ba_goo_0.avi
└── metafile
We use the csv file to provide data lists with video path for train, val and test. As the csv file can be customized, the data folder can be organized as you preferred. Default csv files can be found here.
Prepare Motion Trajectory
<span id="A-using-pre-extracted"></span>
A. Using Pre-extracted Motion Trajectories
We provide pre-extracted motion trajectories for MME pre-training on Kinetics-400 dataset. Decompress and put it into data/trajs/kinetics400
. The trajs folder should be organized as follows:
data/trajs
└── kinetics400
└── train_video
└── abseiling
└── ztuc7tVNUDo_000003_000013.mp4_4.gz
B. Extract Motion Trajectories From Scratch
See EXTRACT_FEATURE.md for details.
Run MME
1. Pretrain the Model Using MME
To pretrain the model on K400 on 2 nodes with 8 x a100(80G) GPUs on each, we set NUM_PROCESS
= 8, NUM_NODES
=2, BATCH_SIE
= 64.
bash scripts/pretrain/k400-1600epo.sh 8 2 0 MASTER_IP 64
2. Finetune the Model on Downstream Datasets
bash scripts/finetune/ssv2/k400pt-800epo.sh 8 2 0 MASTER_IP 28
We also provide finetuned models in the Main Results, you are free to download them and run eval directly.
# first: download the checkpoint and put it into the OUTPUT_DIR
exps/m3video/
└── finetune
└── ssv2
└── k400pt-1600epo
└── checkpoint-best
└── mp_rank_00_model_states.pt
# sceond: run eval!
bash scripts/finetune/ssv2/k400pt-1600epo.sh 8 1 0 localhost 28 --eval
Cite MME
Please star the project and cite our paper if it is helpful for you~
@inproceedings{sun2023mme,
title={Masked Motion Encoding for Self-Supervised Video Representation Learning},
author={Sun, Xinyu and Chen, Peihao and Chen, Liangwei and Li, Changhao and Li, Thomas H and Tan, Mingkui and Gan, Chuang},
booktitle={The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2023}
}
@article{sun2022m3video,
title={M $\^{} 3$ Video: Masked Motion Modeling for Self-Supervised Video Representation Learning},
author={Sun, Xinyu and Chen, Peihao and Chen, Liangwei and Li, Thomas H and Tan, Mingkui and Gan, Chuang},
journal={arXiv preprint arXiv:2210.06096},
year={2022}
}
Acknowledgements
Our code is modified from VideoMAE. Thanks for their awesome work!