Home

Awesome

"That is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Word-Level Adversarial Attacks

Supplementary material

If you use or draw inspiration from this repository, please reference our ACL 2022 paper:

@inproceedings{mosca-etal-2022-suspicious,
title = "{``}That Is a Suspicious Reaction!{''}: Interpreting Logits Variation to Detect {NLP} Adversarial Attacks",
author = "Mosca, Edoardo  and
   Agarwal, Shreyash  and
   Rando Ram{\'\i}rez, Javier  and
   Groh, Georg",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.538",
pages = "7806--7816",
abstract = "Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.",
}

Introduction

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different models, datasets, and word-level attacks.

Code usage guide

In this section, we explain how to use the code to reproduce or extend the results. Keep in mind that this version of the code is built to be self-contained and easy to follow. You may want to split it up for efficiency; e.g. pre-compute and store logits. This way, WDR will not be generated in each execution. Also, you may want to increase the number of detectors trained as we did to report statistical significance. Important considerations:

Code structure

The code is divided following this structure:

Code usage and execution pipeline

These are the steps required to reproduce the project results:

  1. Generating Adversarial Samples/Command Line Adversarial Attack.ipynb -> generate adversarial samples for the desired setup and store them. This step can be skipped because results are already provided in the repository (see Generating Adversarial Samples/Data)
  2. Classifier/Training Classifier/Training_Classifier.ipynb -> Compute logits difference for adversarial and original samples for the desired dataset and create an input dataframe for the model. Then, train and store the adversarial classifier.
  3. Classifier/Testing Classifier/Testing Classifier.ipynb -> Compute logits difference for adversarial and original samples for the desired dataset and then test the results on the created input dataframe containing the logit differences.

Optional: You can use the code within FGWS/ to reproduce the baseline results against which our method was benchmarked.