Home

Awesome

Bootstrapping ViTs

Towards liberating vision Transformers from pre-training.

Official code for paper Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training

Authors: Haofei Zhang, Jiarui Duan, Mengqi Xue, Jie Song, Li Sun, Mingli Song

Results (Top-1 Accuracy)

1. CIFAR

ModelMethodCIFAR-10CIFAR-100
CNNsEfficientNet-B2 <br> ResNet50 <br> Agent-S <br> Agent-B94.14 <br> 94.92 <br> 94.18 <br> 94.8375.55 <br> 77.57 <br> 74.62 <br> 74.78
ViTsViT-S <br> ViT-S-SAM <br> ViT-S-Sparse <br> ViT-B <br> ViT-B-SAM <br> ViT-B-Sparse87.32 <br> 87.77 <br> 87.43 <br> 79.24 <br> 86.57 <br> 83.8761.25 <br> 62.60 <br> 62.29 <br> 53.07 <br> 58.18 <br> 57.22
Pre-trained ViTsViT-S <br> ViT-B95.70 <br> 97.17 <br>80.91 <br> 84.95
Ours JointAgent-S <br> ViT-S <br> Agent-B <br> ViT-B94.90 <br> 95.14 <br> 95.06 <br> 95.0074.06 <br> 76.19 <br> 76.57 <br> 77.83
Ours SharedAgent-S <br> ViT-S <br> Agent-B <br> ViT-B93.22 <br> 93.72 <br> 92.66 <br> 93.3474.06 <br> 75.50 <br> 74.11 <br> 75.71

2. ImageNet

Method5% images10% images50% images
ResNet50 <br> Agent-B35.43 <br> 35.2850.86 <br> 47.4670.05 <br> 68.13
ViT-B <br> ViT-B-SAM <br> ViT-B-Sparse16.60 <br> 16.67 <br> 10.3928.11 <br> 28.66 <br> 28.9263.40 <br> 64.37 <br> 66.01
Ours-Joint <br> Ours-Shared36.01 <br> 33.0649.73 <br> 45.7571.36 <br> 66.48

Quick Start

1. Prepare dataset

2. Prepare cv-lib-PyTorch

Our code requires cv-lib-PyTorch. You should download this repo and checkout to tag bootstrapping_vits.

cv-lib-PyTorch is an open source repo currently maintained by me.

3. Requirements

4. Train from scratch

In dir config, we provide some configurations for training, including CIFAR100 and ImageNet-10%. The following script will start training agent-small from scratch on CIFAR100.

For training with SAM optimizer, the option --worker should be set to sam_train_worker.

export PYTHONPATH=/path/to/cv-lib-PyTorch
export CUDA_VISIBLE_DEVICES=0,1

port=9872
python dist_engine.py \
    --num-nodes 1 \
    --rank 0 \
    --master-url tcp://localhost:${port} \
    --backend nccl \
    --multiprocessing \
    --file-name-cfg cls \
    --cfg-filepath config/cifar100/cnn/agent-small.yaml \
    --log-dir run/cifar100/cnn/agent-small \
    --worker worker

5. Ours Joint

export PYTHONPATH=/path/to/project/cv-lib-PyTorch
export CUDA_VISIBLE_DEVICES=0,1

port=9873
python dist_engine.py \
    --num-nodes 1 \
    --rank 0 \
    --master-url tcp://localhost:${port} \
    --backend nccl \
    --multiprocessing \
    --file-name-cfg joint \
    --cfg-filepath config/cifar100/joint/agent-small-vit-small.yaml \
    --log-dir run/cifar100/joint/agent-small-vit-small \
    --use-amp \
    --worker mutual_worker

6. Ours Shared

export PYTHONPATH=/path/to/project/cv-lib-PyTorch
export CUDA_VISIBLE_DEVICES=0,1

port=9873
python dist_engine.py \
    --num-nodes 1 \
    --rank 0 \
    --master-url tcp://localhost:${port} \
    --backend nccl \
    --multiprocessing \
    --file-name-cfg shared \
    --cfg-filepath config/cifar100/shared/agent-base-res_like-vit-base.yaml \
    --log-dir run/cifar100/shared/agent-base-res_like-vit-base \
    --use-amp \
    --worker mutual_worker

After training, the accuracy of the final epoch is reported instead of the best one.

Citation

If you found this work useful for your research, please cite our paper:

@article{zhang2021bootstrapping,
  title={Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training},
  author={Zhang, Haofei and Duan, Jiarui and Xue, Mengqi and Song, Jie and Sun, Li and Song, Mingli},
  journal={arXiv preprint arXiv:2112.03552},
  year={2021}
}