Home

Awesome

A-CLIP

Introduction

This is the official implementation of Attentive Mask CLIP (A-CLIP, ICCV2023). A-CLIP aims to improve the efficiency of CLIP training by introducing an efficient image augmentation approach called image token removal.

The purpose of this work is to improve the training efficiency of CLIP by using the image token removal strategy. This method has been proven effective in tasks such as mask image modeling. However, random masking in CLIP may lead to semantic mismatch issues that could affect performance. To address this concern, we propose an attentive masking strategy that retains the most relevant areas to the text while removing tokens.

<img src="./docs/masking_Page_1.png" alt="Visualization" width="50%" height="50%">

For example, in the case of a Ferrari image in the top-left corner, it would be difficult to identify and correctly align with the text under random masking. However, with our attentive selection, we only eliminate irrelevant areas while maximizing the preservation of semantics to avoid ambiguity.

Mask Strategy (View × Ratio)IN 1K 0-shotFlickr30K I2TFlickr30K T2IMS COCO I2TMS COCO T2I
w/o mask
1×100%37.651.432.627.917.6
+random mask
1×50%35.048.832.528.916.6
2×50%38.054.634.431.118.7
+attentive mask
1×50%39.557.636.634.219.8
2×50% (highlighted)41.359.338.435.121.3

Experiments show that attentive masking can avoid the semantic mismatch problems caused by random mask and bring the following benefits:

  1. Training efficiency: Eliminates redundant tokens and improves CLIP's training efficiency.
  2. Feature stability: Retains semantically relevant areas, enhances feature stability, and alleviates ambiguity caused by semantic mismatch. By using this approach, our A-CLIP significantly improves both the training efficiency and performance of CLIP.

We compare our attentive mask CLIP (A-CLIP) with CLIP, SLIP, and MaskCLIP. A-CLIP outperforms CLIP by +6.3%, +11.3/+9.5 and +10.1/+5.6 on Imagenet-1K zero-shot classification, Flickr30K and MS COCO I2T/T2I retrieval. An efficient variant termed A-CLIP-eff outperforms CLIP by +5.3%, +11.3/+8.0, and +9.5/+4.9 on these benchmarks, while reducing the training time to 0.86x.

MethodsTraining TimeGPU MemoryIN 1K 0-shotFlickr30K I2T/T2IMS COCO I2T/T2I
CLIP1.00×14G37.651.4/32.627.9/17.6
SLIP2.67×30G42.857.2/41.233.6/21.9
MaskCLIP1.56×16G42.760.0/38.834.1/21.2
A-CLIP1.16×14G43.962.7/42.138.0/23.2
A-CLIP-eff0.86×13G42.962.7/40.637.4/22.5

Note: The full training wall clock time and GPU memory footprint are measured on the same device. We report the training cost relative to the original CLIP.

Zero-shot evaluation on a variety of classification benchmarks. The Epochs indicates the number of training rounds. A-CLIP significantly outperforms other methods at all epochs setting, both in terms of average accuracy and number of winning tracks of above 25 downstream tasks.

EpochsMethodsFood-101CIFAR-10CIFAR-100CUBSUN397CarsAircraftDTDPetsCaltech-101FlowersMNISTFER-2013STL-10EuroSATRESISC45GTSRBKITTICountry211PCAMUCF101Kinetics700CLEVRHatefulMemesSST2ImageNetAverage
25CLIP50.666.034.538.851.14.05.421.228.560.953.38.417.390.530.221.56.135.110.553.528.522.110.852.450.737.634.2
SLIP59.578.645.238.753.45.45.726.131.171.056.69.819.694.420.328.914.534.011.655.437.726.917.552.851.142.838.0
MaskCLIP60.670.141.643.354.04.98.225.536.868.953.611.222.493.935.124.810.130.512.551.237.028.112.952.850.042.737.8
A-CLIP58.382.851.043.057.05.47.626.032.071.657.79.829.795.429.330.313.135.213.551.638.329.614.152.849.943.939.6
50CLIP55.277.043.838.949.04.76.323.527.263.556.212.530.292.121.031.97.433.610.950.835.524.814.049.950.139.436.5
SLIP61.976.848.939.254.87.39.029.831.975.057.79.824.995.637.832.59.035.112.754.441.130.313.849.549.944.139.7
A-CLIP62.281.553.748.258.78.310.227.740.573.361.011.332.995.539.737.59.423.314.463.742.531.619.650.852.346.342.2
100CLIP60.479.444.643.353.08.58.226.234.768.959.211.420.493.223.327.310.323.112.054.036.727.713.050.950.142.737.8
SLIP63.083.150.443.052.08.38.326.234.074.661.116.132.495.122.628.510.534.811.552.137.328.313.755.249.945.039.9
A-CLIP66.786.658.651.458.610.511.933.148.574.964.37.831.296.735.635.812.930.515.757.144.133.122.952.750.748.143.8

The table shows the results of using longer training schedulers and a larger model size.

MethodsIN 1K - 0-shotFlickr30K - I2TFlickr30K - T2IMS COCO - I2TMS COCO - T2I
CLIP(25ep)37.651.432.627.917.6
SLIP(25ep)42.857.241.233.621.9
A-CLIP(25ep)43.962.742.138.023.2
CLIP(50ep)39.453.935.830.219.2
SLIP(50ep)44.160.641.133.222.3
A-CLIP(50ep)46.366.743.239.824.4
CLIP(100ep)42.761.037.934.420.9
SLIP(100ep)45.059.341.434.622.7
A-CLIP(100ep)48.066.345.740.725.1
CLIP(VIT-L)40.451.435.228.918.5
SLIP(VIT-L)46.260.643.735.323.5
A-CLIP(VIT-L)48.964.148.239.126.9

Setup

Install PyTorch and timm. The code has been tested with CUDA 11.6, PyTorch 1.13.0 and timm 0.5.4.

YFCC15M Setup

For data preparation, refer to SLIP.

Pre-training

A-CLIP ViT-Base with 8-nodes (batch size 4096)

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=8 main.py \
  --root /path/to/yfcc100m --dataset yfcc15m --metadata /path/to/yfcc15m.pkl \
  --model ACLIP_VITB16 --batch-size 64 \
  --lr 5e-4 --wd 0.5

Visualization

There are some cases to show Attentive mask magically preserve the content of text descriptions and filter out redundant backgrounds.

<img src="./docs/vis.png" alt="Visualization" width="max-width: 100%; height: auto;">

Citation

If the code and paper help your research, please kindly cite:

@InProceedings{Yang_2023_ICCV, author = {Yang, Yifan and Huang, Weiquan and Wei, Yixuan and Peng, Houwen and Jiang, Xinyang and Jiang, Huiqiang and Wei, Fangyun and Wang, Yin and Hu, Han and Qiu, Lili and Yang, Yuqing}, title = {Attentive Mask CLIP}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {2771-2781} }