Awesome

Top-Down Visual Attention from Analysis by Synthesis

This is the official codebase of AbSViT, from the following paper:

Top-Down Visual Attention from Analysis by Synthesis, CVPR 2023
Baifeng Shi, Trevor Darrell, and Xin Wang
UC Berkeley, Microsoft Research

Website | Paper

To-Dos

Finetuning on Vision-Language datasets

Environment

Install PyTorch 1.7.0+ and torchvision 0.8.1+ from the official website.

requirements.txt lists all the dependencies:

pip install -r requirements.txt

In addition, please also install the magickwand library:

apt-get install libmagickwand-dev

Demo

ImageNet demo: demo/demo.ipynb gives an example of visualizing AbSViT's attention map on single-object and multi-object images in ImageNet. Since the model is only trained on single-object recognition, the top-down attention is quite weak.

VQA demo: vision_language/demo/visualize_attention.ipynb gives an example of how AbSViT's top-down attention is adaptive to different questions on the same image.

Model Zoo

Name	ImageNet	ImageNet-C (↓)	PASCAL VOC	Cityscapes	ADE20K	Weights
ViT-Ti	72.5	71.1	-	-	-	model
AbSViT-Ti	74.1	66.7	-	-	-	model
ViT-S	80.1	54.6	-	-	-	model
AbSViT-S	80.7	51.6	-	-	-	model
ViT-B	80.8	49.3	80.1	75.3	45.2	model
AbSViT-B	81.0	48.3	81.3	76.8	47.2	model

Evaluation on Image Classification

For example, to evaluate AbSViT_small on ImageNet, run

python main.py --model absvit_small_patch16_224 --data-path path/to/imagenet --eval --resume path/to/checkpoint

To evaluate on robustness benchmarks, please add one of --inc_path /path/to/imagenet-c, --ina_path /path/to/imagenet-a, --inr_path /path/to/imagenet-r or --insk_path /path/to/imagenet-sketch to test ImageNet-C, ImageNet-A, ImageNet-R or ImageNet-Sketch.

If you want to test the accuracy under adversarial attackers, please add --fgsm_test or --pgd_test.

Evaluation on Semantic Segmentation

Please see segmentation for instructions.

Training

Take AbSViT_small for an example. We use single node with 8 gpus for training:

python -m torch.distributed.launch --nproc_per_node=8 --master_port 12345  main.py --model absvit_small_patch16_224 --data-path path/to/imagenet  --output_dir output/here  --num_workers 8 --batch-size 128 --warmup-epochs 10

To train different model architectures, please change the arguments --model. We provide choices of ViT_{tiny, small, base}' and AbSViT_{tiny, small, base}.

Finetuning on Vision-Language Dataset

Please see vision_language for instructions.

Links

This codebase is built upon the official code of "Visual Attention Emerges from Recurrent Sparse Reconstruction" and "Towards Robust Vision Transformer".

Citation

If you found this code helpful, please consider citing our work:


@inproceedings{shi2023top,
  title={Top-Down Visual Attention from Analysis by Synthesis},
  author={Shi, Baifeng and Darrell, Trevor and Wang, Xin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2102--2112},
  year={2023}
}