Home

Awesome

CoaT: Co-Scale Conv-Attentional Image Transformers

Introduction

This repository contains the official code and pretrained models for CoaT: Co-Scale Conv-Attentional Image Transformers. It introduces (1) a co-scale mechanism to realize fine-to-coarse, coarse-to-fine and cross-scale attention modeling and (2) an efficient conv-attention module to realize relative position encoding in the factorized attention.

<img src="./figures/model-acc.svg" alt="Model Accuracy" width="600" />

For more details, please refer to CoaT: Co-Scale Conv-Attentional Image Transformers by Weijian Xu*, Yifan Xu*, Tyler Chang, and Zhuowen Tu.

Performance

  1. Classification (ImageNet dataset)

    NameAcc@1Acc@5#Params
    CoaT-Lite Tiny77.593.85.7M
    CoaT-Lite Mini79.194.511M
    CoaT-Lite Small81.995.520M
    CoaT-Lite Medium83.696.745M
    CoaT Tiny78.394.05.5M
    CoaT Mini81.095.210M
    CoaT Small82.196.122M
  2. Instance Segmentation (Mask R-CNN w/ FPN on COCO dataset)

    NameScheduleBbox APSegm AP
    CoaT-Lite Mini1x41.438.0
    CoaT-Lite Mini3x42.938.9
    CoaT-Lite Small1x45.240.7
    CoaT-Lite Small3x45.741.1
    CoaT Mini1x45.140.6
    CoaT Mini3x46.541.8
    CoaT Small1x46.541.8
    CoaT Small3x49.043.7
  3. Object Detection (Deformable-DETR on COCO dataset)

    NameAPAP50AP75APSAPMAPL
    CoaT-Lite Small47.066.551.228.850.363.3
    CoaT Small48.468.552.430.151.863.8

Changelog

12/12/2021: Code and pre-trained checkpoints for Deformable-DETR with CoaT Small backbone are released. <br /> 12/07/2021: Training commands for CoaT-Lite Medium (384x384) are released. <br /> 12/06/2021: Pre-trained checkpoints for CoaT-Lite Medium (384x384) are released. <br /> 12/05/2021: Training scripts for CoaT Small and CoaT-Lite Medium are released. <br /> 09/27/2021: Code and pre-trained checkpoints for instance segmentation with MMDetection are released. <br /> 08/27/2021: Pre-trained checkpoints for CoaT Small and CoaT-Lite Medium are released. <br /> 05/19/2021: Pre-trained checkpoints for Mask R-CNN benchmark with CoaT-Lite Small backbone are released. <br /> 05/19/2021: Code and pre-trained checkpoints for Deformable-DETR with CoaT-Lite Small backbone are released. <br /> 05/11/2021: Pre-trained checkpoints for CoaT-Lite Small are released. <br /> 05/09/2021: Pre-trained checkpoints for Mask R-CNN benchmark with CoaT Mini backbone are released. <br /> 05/06/2021: Pre-trained checkpoints for CoaT Mini are released. <br /> 05/02/2021: Pre-trained checkpoints for CoaT Tiny are released. <br /> 04/25/2021: Code and pre-trained checkpoints for Mask R-CNN benchmark with CoaT-Lite Mini backbone are released. <br /> 04/23/2021: Pre-trained checkpoints for CoaT-Lite Mini are released. <br /> 04/22/2021: Code and pre-trained checkpoints for CoaT-Lite Tiny are released.

Usage

The following usage is provided for the classification task using CoaT model. For the other tasks, please follow the corresponding readme, such as instance segmentation and object detection.

Environment Preparation

  1. Set up a new conda environment and activate it.

    # Create an environment with Python 3.8.
    conda create -n coat python==3.8
    conda activate coat
    
  2. Install required packages.

    # Install PyTorch 1.7.1 w/ CUDA 11.0.
    pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
    
    # Install timm 0.3.2.
    pip install timm==0.3.2
    
    # Install einops.
    pip install einops
    

Code and Dataset Preparation

  1. Clone the repo.

    git clone https://github.com/mlpc-ucsd/CoaT
    cd CoaT
    
  2. Download ImageNet dataset (ILSVRC 2012) and extract.

    # Create dataset folder.
    mkdir -p ./data/ImageNet
    
    # Download the dataset (not shown here) and copy the files (assume the download path is in $DATASET_PATH).
    cp $DATASET_PATH/ILSVRC2012_img_train.tar $DATASET_PATH/ILSVRC2012_img_val.tar $DATASET_PATH/ILSVRC2012_devkit_t12.tar.gz ./data/ImageNet
    
    # Extract the dataset.
    python -c "from torchvision.datasets import ImageNet; ImageNet('./data/ImageNet', split='train')"
    python -c "from torchvision.datasets import ImageNet; ImageNet('./data/ImageNet', split='val')"
    # After the extraction, you should observe `train` and `val` folders under ./data/ImageNet.
    

Evaluate Pre-trained Checkpoint

We provide the CoaT checkpoints pre-trained on the ImageNet dataset.

NameAcc@1Acc@5#ParamsSHA-256 (first 8 chars)URL
CoaT-Lite Tiny77.593.85.7Me88e96b0model, log
CoaT-Lite Mini79.194.511M6b4a8ae5model, log
CoaT-Lite Small81.995.520M8d362f48model, log
CoaT-Lite Medium83.696.745Ma750cd63model, log
CoaT-Lite Medium (384x384)84.597.145Mf9129688model, log
CoaT Tiny78.394.05.5Mc6efc33cmodel, log
CoaT Mini81.095.210M40667eecmodel, log
CoaT Small82.196.122M7479cf9bmodel, log

The following commands provide an example (CoaT-Lite Tiny) to evaluate the pre-trained checkpoint.

# Download the pretrained checkpoint.
mkdir -p ./output/pretrained
wget http://vcl.ucsd.edu/coat/pretrained/coat_lite_tiny_e88e96b0.pth -P ./output/pretrained
sha256sum ./output/pretrained/coat_lite_tiny_e88e96b0.pth  # Make sure it matches the SHA-256 hash (first 8 characters) in the table.

# Evaluate.
# Usage: bash ./scripts/eval.sh [model name] [output folder] [checkpoint path]
bash ./scripts/eval.sh coat_lite_tiny coat_lite_tiny_pretrained ./output/pretrained/coat_lite_tiny_e88e96b0.pth
# It should output results similar to "Acc@1 77.504 Acc@5 93.814" at very last.

Note: For CoaT-Lite Medium with 384x384 input, we use the following command for evaluation:

# Evaluation command for CoaT-Lite Medium (384x384).
bash ./scripts/eval_extra_args.sh coat_lite_medium coat_lite_medium_384x384_pretrained ./output/pretrained/coat_lite_medium_384x384_f9129688.pth --batch-size 128 --input-size 384

Train

The following commands provide an example (CoaT-Lite Tiny, 8-GPU) to train the CoaT model.

# Usage: bash ./scripts/train.sh [model name] [output folder]
bash ./scripts/train.sh coat_lite_tiny coat_lite_tiny

Note: Some training hyperparameters for CoaT Small and CoaT-Lite Medium are different from the default settings:

# Training command for CoaT Small.
bash ./scripts/train_extra_args.sh coat_small coat_small --batch-size 128 --drop-path 0.2 --no-model-ema --warmup-epochs 20 --clip-grad 5.0

# Training command for CoaT-Lite Medium.
bash ./scripts/train_extra_args.sh coat_lite_medium coat_lite_medium --batch-size 128 --drop-path 0.3 --no-model-ema --warmup-epochs 20 --clip-grad 5.0

# Training command for CoaT-Lite Medium (384x384).
bash ./scripts/train_extra_args.sh coat_lite_medium coat_lite_medium_384x384 \
   --resume ./output/pretrained/coat_lite_medium_a750cd63.pth \
   --resume_only_state \
   --batch-size 32 \
   --drop-path 0.2 \
   --no-model-ema \
   --warmup-epochs 0 \
   --clip-grad 5.0 \
   --input-size 384 \
   --lr 5e-6 \
   --min-lr 5e-6 \
   --weight-decay 1e-8 \
   --epochs 6 \
   --save_freq 1

Evaluate

The following commands provide an example (CoaT-Lite Tiny) to evaluate the checkpoint after training.

# Usage: bash ./scripts/eval.sh [model name] [output folder] [checkpoint path]
bash ./scripts/eval.sh coat_lite_tiny coat_lite_tiny_eval ./output/coat_lite_tiny/checkpoints/checkpoint0299.pth

Citation

@InProceedings{Xu_2021_ICCV,
    author    = {Xu, Weijian and Xu, Yifan and Chang, Tyler and Tu, Zhuowen},
    title     = {Co-Scale Conv-Attentional Image Transformers},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {9981-9990}
}

License

This repository is released under the Apache License 2.0. License can be found in LICENSE file.

Acknowledgment

Thanks to DeiT and pytorch-image-models for a clear and data-efficient implementation of ViT. Thanks to lucidrains' implementation of Lambda Networks and CPVT.