Awesome

white2black

INTRODUCTION

The official code to reproduce results in the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks

The code is divided into sub-packages:

1. ./Agents - adversarial learned attck generators

2. ./Attacks - optimization attacks like hot flip

3. ./Toxicity Classifier - a classifier of sentences toxic/non toxic

4. ./Data - data handling

5. ./Resources - resources for other categories

ALGORITHM

As seen in the figure below we train a classifier to predict the class of toxic and non-toxic sentences. We attack this model using a white-box algorithm called hot-flip and distill the knowledge into a second model - DistFlip. DistFlip is able to generate attacks in a black-box manner. These attacks generalize well to the Google Perspective algorithm (tested Jan 2019).

DATA

We used the data from this kaggle challenge by Jigsaw

For data flip using HotFlip+ you can download the data from Google Drive and unzip it into: ./toxic_fool/resources/data

RESULTS

The number of flips needed to change the label of a sentences using the original white box algorithm and ours (green) survival rate

Some example sentences: examples