Awesome

CAST: Cross-Attention in Space and Time for Video Action Recognition [NeurIPS 2023][Project Page][Arxiv]

CAST Framework

GitHub last commit Website Status GitHub issues GitHub closed issue

:wrench: Installation

We conduct all the experiments with 16 NVIDIA GeForce RTX 3090 GPUs. First, install PyTorch 1.10.0+ and torchvision 0.11.0.

conda create -n vmae_1.10  python=3.8 ipykernel -y
conda activate vmae_1.10
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 -c pytorch

Then, install timm, triton, DeepSpeed, and others.

pip install triton==1.0.0
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
git checkout 3a3dfe66bb
DS_BUILD_OPS=1 pip install . --global-option="build_ext"
pip install TensorboardX decord einops scipy pandas requests
ds_report

If you have successfully installed Deepspeed, after running the 'ds_report' command, you can see the following results. For other Deepspeed-related issues, please refer to the DeepSpeed GitHub page.

DS_REPORT

:file_folder: Data Preparation

We report experimental results on three standard datasets.(EPIC-KITCHENS-100, Something-Something-V2, Kinetics400)
We provide sample annotation files -> annotations.

EPIC-KITCHENS-100

The pre-processing of EPIC-KITCHENS-100 can be summarized into 3 steps:
1. Download the dataset from official website.
2. Preprocess the dataset by resizing the short edge of video to 256px. You can refer to MMAction2 Data Benchmark.
3. Generate annotations needed for dataloader ("<video_id>,<verb_class>,<noun_class>" in annotations). The annotation usually includes train.csv, val.csv. The format of *.csv file is like: 
```
video_1,verb_1,noun_1
video_2,verb_2,noun_2
video_3,verb_3,noun_3
...
video_N,verb_N,noun_N
```
4. All video files are located inside the DATA_PATH.

Something-Something-V2

The pre-processing of Something-Something-V2 can be summarized into 3 steps:
1. Download the dataset from official website.
2. Preprocess the dataset by changing the video extension from webm to .mp4 with the original height of 240px. You can refer to MMAction2 Data Benchmark.
3. Generate annotations needed for dataloader ("<video_id> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:
```
video_1.mp4 label_1
video_2.mp4 label_2
video_3.mp4 label_3
...
video_N.mp4 label_N
```
4. All video files are located inside the DATA_PATH.

Kinetics-400

The pre-processing of Kinetics400 can be summarized into 3 steps:
1. Download the dataset from official website or OpenDataLab.
2. Preprocess the dataset by resizing the short edge of video to 320px. You can refer to MMAction2 Data Benchmark.
3. Generate annotations needed for dataloader ("<video_id> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:
```
video_1.mp4 label_1
video_2.mp4 label_2
video_3.mp4 label_3
...
video_N.mp4 label_N
```

All video files should be splited into DATA_PATH/train and DATA_PATH/val.

Expert model preparation

We use the pre-trained weights of spatial and temporal experts. The pretrained weight of the spatial expert (CLIP) uses the official weight. The pre-trained weight of the temporal expert (VideoMAE) uses the pre-trained weights from the three datasets EK100, K400, and SSV2. Of these, K400 and SSV2 use the official weights, and EK100 uses the weights we pre-trained ourselves. Put each downloaded expert weight into the VMAE_PATH and CLIP_PATH of the fine-tune script.

Fine-tuning CAST

We provide the off-the-shelf scripts in the scripts folder.

For example, to fine-tune CAST on Kinetics400 with 16 GPUs (2 nodes x 8 GPUs) script.

DATA_PATH=YOUR_PATH
VMAE_MODEL_PATH=YOUR_PATH
CLIP_MODEL_PATH=YOUR_PATH


OMP_NUM_THREADS=1 python -m torch.distributed.launch \
  --nproc_per_node=2 \
  --master_port ${YOUR_NUMBER} --nnodes=8 \
  --node_rank=${YOUR_NUMBER} --master_addr=${YOUR_NUMBER} \
  YOUR_PATH/run_bidirection_compo.py \
  --data_set Kinetics-400 \
  --nb_classes 400 \
  --vmae_model compo_bidir_vit_base_patch16_224 \
  --anno_path ${ANNOTATION_PATH}
  --data_path ${DATA_PATH} \
  --clip_finetune ${CLIP_MODEL_PATH} \
  --vmae_finetune ${VMAE_MODEL_PATH} \
  --log_dir ${YOUR_PATH} \
  --output_dir ${YOUR_PATH} \
  --batch_size 6 \
  --input_size 224 \
  --short_side_size 224 \
  --save_ckpt_freq 25 \
  --num_sample 1 \
  --num_frames 16 \
  --opt adamw \
  --lr 1e-3 \
  --opt_betas 0.9 0.999 \
  --weight_decay 0.05 \
  --epochs 70 \
  --dist_eval \
  --test_num_segment 5 \
  --test_num_crop 3 \
  --num_workers 8 \
  --drop_path 0.2 \
  --layer_decay 0.75 \
  --mixup_switch_prob 0 \
  --mixup_prob 0.5 \
  --reprob 0. \
  --init_scale 1. \
  --update_freq 6 \
  --seed 0 \
  --enable_deepspeed \
  --warmup_epochs 5 \

Evaluation

Evaluation commands for the EK100.

python ./run_bidirection_compo.py --fine_tune {YOUR_FINETUNED_WEIGHT} --composition --eval

Evaluation commands for the SSV2, K400.

python ./run_bidirection.py --fine_tune {YOUR_FINETUNED_WEIGHT} --eval

Model Zoo

EPIC-KITCHENS-100

Method	Spatial Expert	Temporal expert	Epoch	#Frames x Clips x Crops	Fine-tune	Top-1
CAST	CLIP-B/16	VideoMAE-B/16 (pre-trained on EK100)	50	16x2x3	log/checkpoint<br />	49.3

Something-Something V2

Method	Spatial Expert	Temporal expert	Epoch	#Frames x Clips x Crops	Fine-tune	Top-1
CAST	CLIP-B/16	VideoMAE-B/16 (pre-trained on SSV2)	50	16x2x3	log/checkpoint<br />	71.6

Kinetics-400

Method	Spatial Expert	Temporal expert	Epoch	#Frames x Clips x Crops	Fine-tune	Top-1
CAST	CLIP-B/16	VideoMAE-B/16 (pre-trained on K400)	70	16x5x3	log/checkpoint<br />	85.3

Acknowledgements

This project is built upon VideoMAE, MAE, CLIP and BEiT. Thanks to the contributors of these great codebases.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@article{cast,
  title={CAST: Cross-Attention in Space and Time for Video Action Recognition},
  author={Lee, Dongho and Lee, Jongseo and Choi, Jinwoo},
  booktitle={NeurIPS}},
  year={2023}