Awesome
<img src="https://github.com/jiawangbai/HAT/blob/main/misc/eccv.png" width="200" height="100"/><br/>
HAT
Implementation of HAT https://arxiv.org/pdf/2204.00993
@inproceedings{bai2022improving,
title={Improving Vision Transformers by Revisiting High-frequency Components},
author={Bai, Jiawang and Yuan, Li and Xia, Shu-Tao and Yan, Shuicheng and Li, Zhifeng and Liu, Wei},
booktitle={European Conference on Computer Vision},
year={2022}
}
Requirements
torch>=1.7.0
torchvision>=0.8.0
timm==0.4.5
tlt==0.1.0
pyyaml
apex-amp
ImageNet Classification
Data Preparation
We use the ImageNet-1K training and validation datasets by default. Please save them in [your_imagenet_path].
Training
Training ViT models with HAT using the default settings in our paper on 8 GPUs:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 \
--data_dir [your_imagenet_path] \
--model [your_vit_model_name] \
--adv-epochs 200 \
--adv-iters 3 \
--adv-eps 0.00784314 \
--adv-kl-weight 0.01 \
--adv-ce-weight 3.0 \
--output [your_output_path] \
and_other_parameters_specified_for_your_vit_models...
For instance, we train Swin-T with the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 \
--data_dir [your_imagenet_path] \
--model swin_tiny_patch4_window7_224 \
--adv-epochs 200 \
--adv-iters 3 \
--adv-eps 0.00784314 \
--adv-kl-weight 0.01 \
--adv-ce-weight 3.0 \
--output [your_output_path] \
--batch-size 256 \
--drop-path 0.2 \
--lr 1e-3 \
--weight-decay 0.05 \
--clip-grad 1.0
For training variants of ViT, Swin Transformer, VOLO, we use the hyper-parameters in [3], [4], and [2], respectively.
We also combine HAT with knowledge distillation in [5], using train_kd.py.
Validation
After training, we can use validate.py to evaluate the ViT model trained with HAT.
For instance, we evaluate Swin-T with the following command:
python3 -u validate.py \
--data_dir [your_imagenet_path] \
--model swin_tiny_patch4_window7_224 \
--checkpoint [your_checkpoint_path] \
--batch-size 128 \
--num-gpu 8 \
--apex-amp \
--results-file [your_results_file_path]
Results
Model | Params | FLOPs | Test Size | Top-1 | +HAT Top-1 | Download |
---|---|---|---|---|---|---|
ViT-T | 5.7M | 1.6G | 224 | 72.2 | 73.3 | link |
ViT-S | 22.1M | 4.7G | 224 | 80.1 | 80.9 | link |
ViT-B | 86.6M | 17.6G | 224 | 82.0 | 83.2 | link |
Swin-T | 28.3M | 4.5G | 224 | 81.2 | 82.0 | link |
Swin-S | 49.6M | 8.7G | 224 | 83.0 | 83.3 | link |
Swin-B | 87.8M | 15.4G | 224 | 83.5 | 84.0 | link |
VOLO-D1 | 26.6M | 6.8G | 224 | 84.2 | 84.5 | link |
VOLO-D1 | 26.6M | 22.8G | 384 | 85.2 | 85.5 | link |
VOLO-D5 | 295.5M | 69.0G | 224 | 86.1 | 86.3 | link |
VOLO-D5 | 295.5M | 304G | 448 | 87.0 | 87.2 | link |
VOLO-D5 | 295.5M | 412G | 512 | 87.1 | 87.3 | link |
The result of combining HAT with knowledge distillation in [5] is 84.3% for ViT-B, and it can be downloaded here.
Downstream Tasks
We first pretrain Swin-T/S/B on the ImageNet-1k dataset with our proposed HAT, and then transfer the models to the downstream tasks, including object detection, instance segmentation, and semantic segmentation.
We use the codes in Swin Transformer for Object Detection and Swin Transformer for Semantic Segmentaion, and follow their configurations.
Cascade Mask R-CNN on COCO val 2017
Backbone | Params | FLOPs | Config | AP_box | +HAT AP_box | AP_mask | +HAT AP_mask |
---|---|---|---|---|---|---|---|
Swin-T | 86M | 745G | config | 50.5 | 50.9 | 43.7 | 43.9 |
Swin-S | 107M | 838G | config | 51.8 | 52.5 | 44.7 | 45.4 |
Swin-B | 145M | 982G | config | 51.9 | 52.8 | 45.0 | 45.6 |
UperNet on ADE20K
Backbone | Params | FLOPs | Config | mIoU(MS) | +HAT mIoU(MS) |
---|---|---|---|---|---|
Swin-T | 60M | 945G | config | 46.1 | 46.7 |
Swin-S | 81M | 1038G | config | 49.5 | 49.7 |
Swin-B | 121M | 1088G | config | 49.7 | 50.3 |
[1] Wightman, R. Pytorch image models. https://github.com/rwightman/pytorch-image-models , 2019.
[2] Yuan, L. et al. Volo: Vision outlooker for visual recognition. arXiv, 2021.
[3] Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020.
[4] Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
[5] Touvron H. et al. Training data-efficient image transformers & distillation through attention. ICML, 2021.