Awesome

ToViLaG

Official script of EMNLP 2023 paper: ToViLaG: Your Visual-Language Generative Model is Also An Evildoer.

Metrics

WInToRe Metric

Run the following command to compute the WInToRe metric.

python metrics/toxicity/wintore.py --input wintore_input.txt --output wintore_output.txt --start 0 --end 1 --M 20

Arguments include:

--input: The file for the input toxicity list. See wintore_input.txt for an example.
--output: The file for the output toxicity list. See wintore_output.txt for an example.
--start: Start of the threshold
--end: End of the threshold
--M: The number of the threshold set.

Quality Metrics

Image-to-text metrics: BERTScore, ROUGE, and CLIPSIM.

Text-to-image metrics: IS, FID, and CLIPSIM.

Toxicity Classifier

Text toxicity classifier: Perspective API. A simple direct implementation is available here.

Image toxicity classifiers: We use part of toxic images to fine-tune three ViT-Huge models for the three types of toxicity, respectively.

ToViLaG Dataset

Statistic

Category	Number of Image	Number of Text
Mono-toxic pairs <toxic image, non-toxic text>	4,349	10,000
Mono-toxic pairs <toxic text, non-toxic image>	10,000	9,794
Co-toxic pairs <toxic text, toxic image>	5,142	9,869
Provocative text prompts		902
Unpaired	21,559	31,674

Unpaired data

Unpaired toxic images:

Pornographic images: Download the NSFW Image Classification dataset from Kaggle. We use the porn class in the test set for toxicity benchmarking, with a total of 8,595 images.
Violent images: Request UCLA Protest Image Dataset from here provided in Won et. al., Protest Activity Detection and Perceived Violence Estimation from Social Media Images, ACM Multimedia 2017. We use the combination of the protest class from the train and test sets for toxicity benchmarking, with a total of 11,659 images.
Bloody images: Please contact me via email to obtain the images, totaling 1,305 images for toxicity benchmarking.

Unpaired toxic text: We use part of them (21,805 text) for toxicity benchmarking, which can be downloaded from here;

Mono-toxic pairs

<toxic image, non-toxic text>

Toxic images: Same with the unpaired toxic images.
Non-toxic text: Generated by GIT for toxic images. Filtered by PerspectiveAPI, PPL, CLIPScore, Jaccard similarity.

<toxic text, non-toxic image>

Ready-made：Detected and collected from existing VL datasets.

Datasets Number of toxic pairs
COCO 570
Flickr30k 233
CC12M 4286
Augmented：
- Non-toxic images: From part of COCO.
- Toxic text: Rewritten by fBERT on corresponding text of non-toxic images; Filtered by PerspectiveAPI, PPL, CLIPScore, Jaccard similarity.

Datasets	Number of toxic pairs
COCO	570
Flickr30k	233
CC12M	4286

Co-toxic pairs

Toxic images: Same with the unpaired toxic images.
Toxic text: Generated by BLIP for toxic images; Filtered by PerspectiveAPI, PPL, CLIPScore, Jaccard similarity.

Innocuous provocative text prompts

Constructed by a gradient-guided search method on Stable Diffusion.

Download the prompts from here.

Toxicity Analysis

Toxicity Benchmarking

Image-to-text generation

We use 21,559 toxic images to evaluate the I2T models.

All models apply the top-k and top-p sampling to generate outputs in our paper. The toxicity evaluation results of each model are as follows:

Models	TP% ↑	WInToRe% ↓
OFA	3.41	90.16
VinVL	2.06	89.56
CLIP-ViL$_{RN50}$	0.74	88.99
GIT	11.57	86.13
GRIT	12.79	84.70
LLaVA	29.25	80.89
BLIP	32.51	75.66
BLIP2$_{OPT2.7B-COCO}$	37.61	66.55
BLIP2$_{OPT2.7B}$	40.41	64.76

Text-to-image generation

We use 21,805 toxic prompts and 902 provocative prompts to evaluate the T2I models.

The toxicity evaluation results of each model are as follows:

Models	Toxic Prompts		Provocative Prompts
	TP% ↑	WInToRe% ↓	TP% ↑	WInToRe% ↓
CogView2	8.10	81.37	44.68	-8.59
DALLE-Mage	10.19	80.96	33.15	-7.29
OFA	19.08	80.64	37.03	-7.44
Stable Diffusion	23.32	80.12	100	-19.02
LAFITE	21.48	79.33	27.38	-6.51
CLIP-GEN	22.93	79.97	7.32	1.18

Toxicity Injection

We use the mono-toxic pairs and the co-toxic pairs to fine-tune each model, respectively.

Image-to-text generation models: GIT, GRIT, BLIP

Text-to-image generation models: Stable Diffusion, LAFITE, CLIP-GEN

SMIB Detoxification Method

We apply the SMIB method into three models in our paper: GIT, GRIT, and BLIP.

We use 5,000 non-toxic image-text pairs from COCO and 5,000 toxic ones from our co-toxic pairs for training. We take the implementation of BLIP with SMIB as an example.

Run the following command to train the detoxification process of the BLIP model:

python method/BLIP/train_caption_detox.py --output_dir outputs/detox --device 1

Infer the detoxified text for toxic images:

python method/BLIP/inference.py --image_path /path/to/toxic_images/ --model_size large --device 1

Contact

If you have any problems on implementation or any other questions, feel free to post a issue or email me (wangxinpeng@tongji.edu.cn).