Home

Awesome

Set-level Guidance Attack

The official repository for Set-level Guidance Attack (SGA).
ICCV 2023 Oral Paper: Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (https://arxiv.org/abs/2307.14061)

Please feel free to contact wangzq_2021@outlook.com if you have any question. <img src='./imgs/sga.png' width=90%>

Brief Introduction

Vision-language pre-training (VLP) models have shown vulnerability to adversarial attacks. However, existing works mainly focus on the adversarial robustness of VLP models in the white-box settings. In this work, we inverstige the robustness of VLP models in the black-box setting from the perspective of adversarial transferability. We propose Set-level Guidance Attack (SGA), which can generate highly transferable adversarial examples aimed for VLP models.

Quick Start

1. Install dependencies

See in requirements.txt.

2. Prepare datasets and models

Download the datasets, Flickr30k and MSCOCO (the annotations is provided in ./data_annotation/). Set the root path of the dataset in ./configs/Retrieval_flickr.yaml, image_root.
The checkpoints of the fine-tuned VLP models is accessible in ALBEF, TCL, CLIP.

3. Attack evaluation

From ALBEF to TCL on the Flickr30k dataset:

python eval_albef2tcl_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ALBEF  --source_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--target_model TCL --target_ckpt ./checkpoint/tcl_retrieval_flickr.pth \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5

From ALBEF to CLIP<sub>ViT</sub> on the Flickr30k dataset:

python eval_albef2clip-vit_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ALBEF  --source_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--target_model ViT-B/16 --original_rank_index ./std_eval_idx/flickr30k/ \
--scales 0.5,0.75,1.25,1.5

From CLIP<sub>ViT</sub> to ALBEF on the Flickr30k dataset:

python eval_clip-vit2albef_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ViT-B/16  --target_model ALBEF \
--target_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5

From CLIP<sub>ViT</sub> to CLIP<sub>CNN</sub> on the Flickr30k dataset:

python eval_clip-vit2clip-cnn_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ViT-B/16  --target_model RN101 \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5

Transferability Evaluation

Existing adversarial attacks for VLP models cannot generate highly transferable adversarial examples.
(Note: Sep-Attack indicates the simple combination of two unimodal adversarial attacks: PGD and BERT-Attack)

<table style="border-collapse: collapse; width: 100%;"> <thead> <tr><th rowspan="2">Attack</th><th colspan="2">ALBEF*</th><th colspan="2">TCL</th><th colspan="2">CLIP<sub>ViT</sub></th><th colspan="2">CLIP<sub>CNN</sub></th></tr> <tr><th>TR R@1*</th><th>IR R@1*</th><th>TR R@1</th><th>IR R@1</th><th>TR R@1</th><th>IR R@1</th><th>TR R@1</th><th>IR R@1</th></tr> </thead> <tbody> <tr><td>Sep-Attack</td><td>65.69</td><td>73.95</td><td>17.60</td><td>32.95</td><td>31.17</td><td><strong>45.23</strong></td><td>32.82</td><td>45.49</td></tr> <tr><td>Sep-Attack + MI</td><td>58.81</td><td>65.25</td><td>16.02</td><td>28.19</td><td>23.07</td><td>36.98</td><td>26.56</td><td>39.31</td></tr> <tr><td>Sep-Attack + DIM</td><td>56.41</td><td>64.24</td><td>16.75</td><td>29.55</td><td>24.17</td><td>37.60</td><td>25.54</td><td>38.77</td></tr> <tr><td>Sep-Attack + PNA_PO</td><td>40.56</td><td>53.95</td><td>18.44</td><td>30.98</td><td>22.33</td><td>37.02</td><td>26.95</td><td>38.63</td></tr> <tr><td>Co-Attack</td><td>77.16</td><td>83.86</td><td>15.21</td><td>29.49</td><td>23.60</td><td>36.48</td><td>25.12</td><td>38.89</td></tr> <tr><td>Co-Attack + MI</td><td>64.86</td><td>75.26</td><td>25.40</td><td>38.69</td><td>24.91</td><td>37.11</td><td>26.31</td><td>38.97</td></tr> <tr><td>Co-Attack + DIM</td><td>47.03</td><td>62.28</td><td>22.23</td><td>35.45</td><td>25.64</td><td>38.50</td><td>26.95</td><td>40.58</td></tr> <tr style="background-color: #EEEEEE;"><td>SGA</td><td><strong>97.24</strong></td><td><strong>97.28</strong></td><td><strong>45.42</strong></td><td><strong>55.25</strong></td><td><strong>33.38</strong></td><td>44.16</td><td><strong>34.93</strong></td><td><strong>46.57</strong></td></tr> </tbody> </table>

The performance of SGA on four VLP models (ALBEF, TCL, CLIP<sub>ViT</sub> and CLIP<sub>CNN</sub>), the Flickr30k dataset.

<table style="border-collapse: collapse; width: 100%;"> <thead> <tr><th rowspan="2">Source</th><th rowspan="2">Attack</th><th colspan="2">ALBEF</th><th colspan="2">TCL</th><th colspan="2">CLIP<sub>ViT</sub></th><th colspan="2">CLIP<sub>CNN</sub></th></tr> <tr><th>TR R@1</th><th>IR R@1</th><th>TR R@1</th><th>IR R@1</th><th>TR R@1</th><th>IR R@1</th><th>TR R@1</th><th>IR R@1</th></tr> </thead> <tbody> <!-- ALBEF --> <tr><td rowspan="5">ALBEF</td><td>PGD</td><td>52.45*</td><td>58.65*</td><td>3.06</td><td>6.79</td><td>8.96</td><td>13.21</td><td>10.34</td><td>14.65</td></tr> <tr><td>BERT-Attack</td><td>11.57*</td><td>27.46*</td><td>12.64</td><td>28.07</td><td>29.33</td><td>43.17</td><td>32.69</td><td>46.11</td></tr> <tr><td>Sep-Attack</td><td>65.69*</td><td>73.95*</td><td>17.60</td><td>32.95</td><td>31.17</td><td><strong>45.23</strong></td><td>32.82</td><td>45.49</td></tr> <tr><td>Co-Attack</td><td>77.16*</td><td>83.86*</td><td>15.21</td><td>29.49</td><td>23.60</td><td>36.48</td><td>25.12</td><td>38.89</td></tr> <tr style="background-color: #EEEEEE;"><td>SGA</td><td><strong>97.24±0.22*</td><td><strong>97.28±0.15*</td><td><strong>45.42±0.60</td><td><strong>55.25±0.06</td><td><strong>33.38±0.35</td><td>44.16±0.25</td><td><strong>34.93±0.99</td><td><strong>46.57±0.13</td></tr> </tbody> <tbody> <!-- TCL --> <tr><td rowspan="5">TCL</td><td>PGD</td><td>6.15</td><td>10.78</td><td>77.87*</td><td>79.48*</td><td>7.48</td><td>13.72</td><td>10.34</td><td>15.33</td></tr> <tr><td>BERT-Attack</td><td>11.89</td><td>26.82</td><td>14.54*</td><td>29.17*</td><td>29.69</td><td>44.49</td><td>33.46</td><td>46.07</td></tr> <tr><td>Sep-Attack</td><td>20.13</td><td>36.48</td><td>84.72*</td><td>86.07*</td><td>31.29</td><td>44.65</td><td>33.33</td><td>45.80</td></tr> <tr><td>Co-Attack</td><td>23.15</td><td>40.04</td><td>77.94*</td><td>85.59*</td><td>27.85</td><td>41.19</td><td>30.74</td><td>44.11</td></tr> <tr style="background-color: #EEEEEE;"><td>SGA</td><td><strong>48.91±0.74</td><td><strong>60.34±0.10</td><td><strong>98.37±0.08*</td><td><strong>98.81±0.07*</td><td><strong>33.87±0.18</td><td><strong>44.88±0.54</td><td><strong>37.74±0.27</td><td><strong>48.30±0.34</td></tr> </tbody> <tbody> <!-- CLIP ViT --> <tr><td rowspan="5">CLIP<sub>ViT</sub></td><td>PGD</td><td>2.50</td><td>4.93</td><td>4.85</td><td>8.17</td><td>70.92*</td><td>78.61*</td><td>5.36</td><td>8.44</td></tr> <tr><td>BERT-Attack</td><td>9.59</td><td>22.64</td><td>11.80</td><td>25.07</td><td>28.34*</td><td>39.08*</td><td>30.40</td><td>37.43</td></tr> <tr><td>Sep-Attack</td><td>9.59</td><td>23.25</td><td>11.38</td><td>25.60</td><td>79.75*</td><td>86.79*</td><td>30.78</td><td>39.76</td></tr> <tr><td>Co-Attack</td><td>10.57</td><td>24.33</td><td>11.94</td><td>26.69</td><td>93.25*</td><td>95.86*</td><td>32.52</td><td>41.82</td></tr> <tr style="background-color: #EEEEEE;"><td>SGA</td><td><strong>13.40±0.07</td><td><strong>27.22±0.06</td><td><strong>16.23±0.45</td><td><strong>30.76±0.07</td><td><strong>99.08±0.08*</td><td><strong>98.94±0.00*</td><td><strong>38.76±0.27</td><td><strong>47.79±0.58</td></tr> </tbody> <tbody> <!-- CLIP_CNN --> <tr><td rowspan="5">CLIP<sub>CNN</sub></td><td>PGD</td><td>2.09</td><td>4.82</td><td>4.00</td><td>7.81</td><td>1.10</td><td>6.60</td><td>86.46*</td><td>92.25*</td></tr> <tr><td>BERT-Attack</td><td>8.86</td><td>23.27</td><td>12.33</td><td>25.48</td><td>27.12</td><td>37.44</td><td>30.40*</td><td>40.10*</td></tr> <tr><td>Sep-Attack</td><td>8.55</td><td>23.41</td><td>12.64</td><td>26.12</td><td>28.34</td><td>39.43</td><td>91.44*</td><td>95.44*</td></tr> <tr><td>Co-Attack</td><td>8.79</td><td>23.74</td><td>13.10</td><td>26.07</td><td>28.79</td><td>40.03</td><td>94.76*</td><td>96.89*</td></tr> <tr style="background-color: #EEEEEE;"><td>SGA</td><td><strong>11.42±0.07</td><td><strong>24.80±0.28</td><td><strong>14.91±0.08</td><td><strong>28.82±0.11</td><td><strong>31.24±0.42</td><td><strong>42.12±0.11</td><td><strong>99.24±0.18*</td><td><strong>99.49±0.05*</td></tr> </tbody> </table>

Visualization

<p align="left"> <img src="./imgs/visualization.png" width=100%\> </p>

Citation

Kindly include a reference to this paper in your publications if it helps your research:

@misc{lu2023setlevel,
    title={Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models},
    author={Dong Lu and Zhiqiang Wang and Teng Wang and Weili Guan and Hongchang Gao and Feng Zheng},
    year={2023},
    eprint={2307.14061},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}