Home

Awesome

<img src="https://github.com/jiawangbai/HAT/blob/main/misc/eccv.png" width="200" height="100"/><br/>

HAT

Implementation of HAT https://arxiv.org/pdf/2204.00993

@inproceedings{bai2022improving,
  title={Improving Vision Transformers by Revisiting High-frequency Components},
  author={Bai, Jiawang and Yuan, Li and Xia, Shu-Tao and Yan, Shuicheng and Li, Zhifeng and Liu, Wei},
  booktitle={European Conference on Computer Vision},
  year={2022}
}

Requirements

torch>=1.7.0
torchvision>=0.8.0
timm==0.4.5
tlt==0.1.0
pyyaml
apex-amp

ImageNet Classification

Data Preparation

We use the ImageNet-1K training and validation datasets by default. Please save them in [your_imagenet_path].

Training

Training ViT models with HAT using the default settings in our paper on 8 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 \
--data_dir [your_imagenet_path] \
--model [your_vit_model_name] \
--adv-epochs 200 \
--adv-iters 3 \
--adv-eps 0.00784314 \
--adv-kl-weight 0.01 \
--adv-ce-weight 3.0 \
--output [your_output_path] \
and_other_parameters_specified_for_your_vit_models...

For instance, we train Swin-T with the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 \
--data_dir [your_imagenet_path] \
--model swin_tiny_patch4_window7_224 \
--adv-epochs 200 \
--adv-iters 3 \
--adv-eps 0.00784314 \
--adv-kl-weight 0.01 \
--adv-ce-weight 3.0 \
--output [your_output_path] \
--batch-size 256 \
--drop-path 0.2 \
--lr 1e-3 \
--weight-decay 0.05 \
--clip-grad 1.0

For training variants of ViT, Swin Transformer, VOLO, we use the hyper-parameters in [3], [4], and [2], respectively.

We also combine HAT with knowledge distillation in [5], using train_kd.py.

Validation

After training, we can use validate.py to evaluate the ViT model trained with HAT.

For instance, we evaluate Swin-T with the following command:

python3 -u validate.py \
--data_dir [your_imagenet_path] \
--model swin_tiny_patch4_window7_224 \
--checkpoint [your_checkpoint_path] \
--batch-size 128 \
--num-gpu 8 \
--apex-amp \
--results-file [your_results_file_path]

Results

ModelParamsFLOPsTest SizeTop-1+HAT Top-1Download
ViT-T5.7M1.6G22472.273.3link
ViT-S22.1M4.7G22480.180.9link
ViT-B86.6M17.6G22482.083.2link
Swin-T28.3M4.5G22481.282.0link
Swin-S49.6M8.7G22483.083.3link
Swin-B87.8M15.4G22483.584.0link
VOLO-D126.6M6.8G22484.284.5link
VOLO-D126.6M22.8G38485.285.5link
VOLO-D5295.5M69.0G22486.186.3link
VOLO-D5295.5M304G44887.087.2link
VOLO-D5295.5M412G51287.187.3link

The result of combining HAT with knowledge distillation in [5] is 84.3% for ViT-B, and it can be downloaded here.

Downstream Tasks

We first pretrain Swin-T/S/B on the ImageNet-1k dataset with our proposed HAT, and then transfer the models to the downstream tasks, including object detection, instance segmentation, and semantic segmentation.

We use the codes in Swin Transformer for Object Detection and Swin Transformer for Semantic Segmentaion, and follow their configurations.

Cascade Mask R-CNN on COCO val 2017

BackboneParamsFLOPsConfigAP_box+HAT AP_boxAP_mask+HAT AP_mask
Swin-T86M745Gconfig50.550.943.743.9
Swin-S107M838Gconfig51.852.544.745.4
Swin-B145M982Gconfig51.952.845.045.6

UperNet on ADE20K

BackboneParamsFLOPsConfigmIoU(MS)+HAT mIoU(MS)
Swin-T60M945Gconfig46.146.7
Swin-S81M1038Gconfig49.549.7
Swin-B121M1088Gconfig49.750.3

[1] Wightman, R. Pytorch image models. https://github.com/rwightman/pytorch-image-models , 2019.
[2] Yuan, L. et al. Volo: Vision outlooker for visual recognition. arXiv, 2021.
[3] Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020.
[4] Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
[5] Touvron H. et al. Training data-efficient image transformers & distillation through attention. ICML, 2021.