Home

Awesome

This repo is to extract SwinBERT's features used in our TIP paper

Concept-Aware Video Captioning: Describing Videos With Effective Prior Information

Bang Yang, Meng Cao and Yuexian Zou*.

[IEEE Xplore], [Github]

Here are the instructions:

  1. Download fine-tuned MSVD (test split CIDEr: 120.6) and MSRVTT (test split CIDEr: 53.8) checkpoints from SwinBERT repo.
  2. Convert SwinBERT checkpoints:
python convert_swinbert.py ./models/table1/msvd/best-checkpoint/model.bin swinbert_msvd.pth

python convert_swinbert.py ./models/table1/msrvtt/best-checkpoint/model.bin swinbert_msrvtt.pth
  1. Extract features:
config=configs/recognition/swin/swin_base_patch244_window877_kinetics600_22k.py

MSVD_root=/data/video_datasets/MSVD

python extract_features.py \
$config \
swinbert_msvd.pth \
--batch_size 4 \
--num_workers 8 \
--img_fn_format image_%05d.jpg \
--dense \
--video_root $MSVD_root/all_frames \
--save_path $MSVD_root/feats/motion_swinbert_kinetics_cliplen64_dense.hdf5 


MSRVTT_root=/data/video_datasets/MSRVTT
python extract_features.py \
$config \
swinbert_msrvtt.pth \
--batch_size 4 \
--num_workers 8 \
--img_fn_format image_%05d.jpg \
--dense \
--video_root $MSRVTT_root/all_frames \
--save_path $MSRVTT_root/feats/motion_swinbert_kinetics_cliplen64_dense.hdf5
  1. Now you can use SwinBERTDense feats in yangbang18/CARE.git.

Video Swin Transformer

PWC PWC PWC

By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.

This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.

Updates

06/25/2021 Initial commits

Introduction

Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

teaser

Results and Models

Kinetics 400

BackbonePretrainLr Schdspatial cropacc@1acc@5#paramsFLOPsconfigmodel
Swin-TImageNet-1K30ep22478.893.628M87.9Gconfiggithub/baidu
Swin-SImageNet-1K30ep22480.694.550M165.9Gconfiggithub/baidu
Swin-BImageNet-1K30ep22480.694.688M281.6Gconfiggithub/baidu
Swin-BImageNet-22K30ep22482.795.588M281.6Gconfiggithub/baidu

Kinetics 600

BackbonePretrainLr Schdspatial cropacc@1acc@5#paramsFLOPsconfigmodel
Swin-BImageNet-22K30ep22484.096.588M281.6Gconfiggithub/baidu

Something-Something V2

BackbonePretrainLr Schdspatial cropacc@1acc@5#paramsFLOPsconfigmodel
Swin-BKinetics 40060ep22469.692.789M320.6Gconfiggithub/baidu

Notes:

Usage

Installation

Please refer to install.md for installation.

We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.

Data Preparation

Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.

We also share our Kinetics-400 annotation file k400_val, k400_train for better comparison.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy

Training

To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-T model for Kinetics-400 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> 

To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-B model for SSv2 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liu2021video,
  title={Video Swin Transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  journal={arXiv preprint arXiv:2106.13230},
  year={2021}
}

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Other Links

Image Classification: See Swin Transformer for Image Classification.

Object Detection: See Swin Transformer for Object Detection.

Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Self-Supervised Learning: See MoBY with Swin Transformer.