Awesome

WinoQueer

Paper

Our paper, WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models can be found on arXiV!

Benchmark

The winoqueer benchmark dataset is at: data/winoqueer_final.csv.

Columns are:

Col. Name	Meaning
Gender_ID_X	bias target category
Gender_ID_Y	counterfactual target category
sent_x	biased/offensive sentence
sent_y	counterfactual sentence

To run WinoQueer evaluation on your own language model, use code/metric.py for masked LMs and code/metric\_autoregressive.py for autoregressive LMs.

Usage: python metric(\_autoregressive).py --input\_file <path to winoqueer\_final.csv> --lm\_model\_path <path to model directory> --output\_file <path to CSV for detailed output> --summary\_file <path to file for summary output (optional)>

Evaluation scripts are forked from CrowS-Pairs evaluation script: https://github.com/nyu-mll/crows-pairs and metric_autoregressive.py was modified for autoregressive models.

Finetuning Data

data/article_urls.csv: metadata and URLs of news articles used to finetune models. Due to licensing requirments, we are not allowed to share the full text of the articles, but you can "rehydrate" the URLs using tools from Media Cloud: https://mediacloud.org/open-source.

Once rehydrated, segment sentences with code/preproc/segment_articles.py, then train a model with one of code/finetune_*.py.

data/tweetIDs.csv.zip: TweetIDs for Twitter data used to finetune models. TweetIDs must be "rehydrated" using the Twitter API before they can be used. TweetIDs are provided for non-commericial research purposes only. Provided as gzip due to large file size - you'll need to gunzip it first.

Finetuning Scripts

Preprocessing

code/preproc/segment_articles.py : script for sentence segmenting news articles.

code/TweetNormalizer.py: Script for normalizing tweets. Called from finetune_*.py, there is no need to call this as a separate preprocessing step. Fork of BERTweet tweet normalizer: https://github.com/VinAIResearch/BERTweet

Training

code/ds_config_general.json: DeepSpeed configuration file.

code/finetune_model.py: Used to finetune all versions of BERT, RoBERTa, and ALBERT Usage: python finetune_model.py <path to model dir> <path to training data> {'n' for news, 't' for twitter}

code/finetune_autoregressive.py: Used to finetune all versions of GPT2 and BLOOM (requires DeepSpeed).

Usage: deepspeed --num_gpus=<desired number of GPUS> finetune_autoregressive.py <path to model dir> <path to training data> {'n' for news, 't' for twitter} --deepspeed ds_config_general.json DeepSpeed defaults to port 29500. If you want to launch two deepspeed training runs on the same machine, use --master_port=<something other than 29500> to avoid a port conflict.

code/finetune_opt_350m.py: Used to finetune opt-350m.

Usage: python finetune_opt_350m.py <path to model dir> <path to training data> {'n' for news, 't' for twitter}

General

code/requirements.txt: versioning information for python packages. We used Python 3.9.12 with pip 22.0.4.