Home

Awesome

CAST: Cross-Attention in Space and Time for Video Action Recognition [NeurIPS 2023][Project Page][Arxiv]

CAST Framework <br>

PWC

GitHub last commit<br> Website Status<br> GitHub issues GitHub closed issue<br>

:wrench: Installation

We conduct all the experiments with 16 NVIDIA GeForce RTX 3090 GPUs. First, install PyTorch 1.10.0+ and torchvision 0.11.0.

conda create -n vmae_1.10  python=3.8 ipykernel -y
conda activate vmae_1.10
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 -c pytorch

Then, install timm, triton, DeepSpeed, and others.

pip install triton==1.0.0
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
git checkout 3a3dfe66bb
DS_BUILD_OPS=1 pip install . --global-option="build_ext"
pip install TensorboardX decord einops scipy pandas requests
ds_report

If you have successfully installed Deepspeed, after running the 'ds_report' command, you can see the following results. For other Deepspeed-related issues, please refer to the DeepSpeed GitHub page.

DS_REPORT

:file_folder: Data Preparation

EPIC-KITCHENS-100

Something-Something-V2

Kinetics-400

  1. All video files should be splited into DATA_PATH/train and DATA_PATH/val.

Expert model preparation

We use the pre-trained weights of spatial and temporal experts. The pretrained weight of the spatial expert (CLIP) uses the official weight. The pre-trained weight of the temporal expert (VideoMAE) uses the pre-trained weights from the three datasets EK100, K400, and SSV2. Of these, K400 and SSV2 use the official weights, and EK100 uses the weights we pre-trained ourselves. Put each downloaded expert weight into the VMAE_PATH and CLIP_PATH of the fine-tune script.

Fine-tuning CAST

We provide the off-the-shelf scripts in the scripts folder.

DATA_PATH=YOUR_PATH
VMAE_MODEL_PATH=YOUR_PATH
CLIP_MODEL_PATH=YOUR_PATH


OMP_NUM_THREADS=1 python -m torch.distributed.launch \
  --nproc_per_node=2 \
  --master_port ${YOUR_NUMBER} --nnodes=8 \
  --node_rank=${YOUR_NUMBER} --master_addr=${YOUR_NUMBER} \
  YOUR_PATH/run_bidirection_compo.py \
  --data_set Kinetics-400 \
  --nb_classes 400 \
  --vmae_model compo_bidir_vit_base_patch16_224 \
  --anno_path ${ANNOTATION_PATH}
  --data_path ${DATA_PATH} \
  --clip_finetune ${CLIP_MODEL_PATH} \
  --vmae_finetune ${VMAE_MODEL_PATH} \
  --log_dir ${YOUR_PATH} \
  --output_dir ${YOUR_PATH} \
  --batch_size 6 \
  --input_size 224 \
  --short_side_size 224 \
  --save_ckpt_freq 25 \
  --num_sample 1 \
  --num_frames 16 \
  --opt adamw \
  --lr 1e-3 \
  --opt_betas 0.9 0.999 \
  --weight_decay 0.05 \
  --epochs 70 \
  --dist_eval \
  --test_num_segment 5 \
  --test_num_crop 3 \
  --num_workers 8 \
  --drop_path 0.2 \
  --layer_decay 0.75 \
  --mixup_switch_prob 0 \
  --mixup_prob 0.5 \
  --reprob 0. \
  --init_scale 1. \
  --update_freq 6 \
  --seed 0 \
  --enable_deepspeed \
  --warmup_epochs 5 \

Evaluation

Evaluation commands for the EK100.

python ./run_bidirection_compo.py --fine_tune {YOUR_FINETUNED_WEIGHT} --composition --eval

Evaluation commands for the SSV2, K400.

python ./run_bidirection.py --fine_tune {YOUR_FINETUNED_WEIGHT} --eval

Model Zoo

EPIC-KITCHENS-100

MethodSpatial ExpertTemporal expertEpoch#Frames x Clips x CropsFine-tuneTop-1
CASTCLIP-B/16VideoMAE-B/16 (pre-trained on EK100)5016x2x3log/checkpoint<br />49.3

Something-Something V2

MethodSpatial ExpertTemporal expertEpoch#Frames x Clips x CropsFine-tuneTop-1
CASTCLIP-B/16VideoMAE-B/16 (pre-trained on SSV2)5016x2x3log/checkpoint<br />71.6

Kinetics-400

MethodSpatial ExpertTemporal expertEpoch#Frames x Clips x CropsFine-tuneTop-1
CASTCLIP-B/16VideoMAE-B/16 (pre-trained on K400)7016x5x3log/checkpoint<br />85.3

Acknowledgements

This project is built upon VideoMAE, MAE, CLIP and BEiT. Thanks to the contributors of these great codebases.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@article{cast,
  title={CAST: Cross-Attention in Space and Time for Video Action Recognition},
  author={Lee, Dongho and Lee, Jongseo and Choi, Jinwoo},
  booktitle={NeurIPS}},
  year={2023}