Awesome

Implementation of our EMNLP 2020 Findings paper: UnQovering Stereotyping Biases via Underspecified Questions

@inproceedings{li2020unqover,
      author    = {Li, Tao and Khot, Tushar and Khashabi, Daniel and Sabharwal, Ashish and Srikumar, Vivek},
      title     = {{U}n{Q}overing Stereotyping Biases via Underspecified Questions},
      booktitle = {Findings of EMNLP},
      year      = {2020}
  }

You can also try our visualization demo.

Headsup

This repo comes with templates and fillers (subjects and attributes) we used to generate underspecified questions for the gender, nationality, ethnicity, and religion datasets; and code for training and evaluating QA and LMs.

- Warning: attributes/activities in this repo are potentially offensive.

First make a data directory ./data to hold generated examples and model predictions. In addition to the dependency listed in requirements.txt, please install Nvidia-apex here.

The modules in this repo are structured like this:

./qa      # code for training and prediction of QA models.
./qa_hf   # code for predicting using QA models specifically trained via HuggingFace's interfaces (e.g. run_squad.py).
./lm      # code for predicting using masked language models (LMs) via HuggingFace's transformer interfaces.
./visualization   # jupyter notebooks for reproducing plots in our paper.
./templates       # code for dataset generation.
./word_lists      # templates and slot fillers for dataset generation.
./utils           # some utilities
./scripts         # scripts for quick reproduction

And the flow of this readme is:

Reproducing Our Results (the fast way)
Starting From Scratch (the slow way)
- Step 1: How to generate underspecified examples from scratch
- Step 2: How to use trained QA or pre-trained LMs to predict on the generated examples
- Step 3: How to run analysis over model predictions
- Step 4: How to visualize analysis results
Appendix: How to train QA models on your own and tools that might come in handy

Reproducing Our Results

To make our paper reproducible, we have made the models and their predictions used in our study available.

Get Model Predictions

You may start by downloading those model dumps:

./scripts/download_predictions.sh

which will download and unpack model predictions into ./data/.

Alternatively, you can generate the underspecified examples and use pre-trained models (will be downloaded automatically) to predict on them:

./scripts/generate_questions.sh --extra newsqa
./scripts/generate_predictions.sh --d gender,country,ethnicity,religion --gpuid [GPUID]

where [GPUID] is the GPU device index. The first command will generate examples for masked LM, SQuAD models, and NewsQA models. The second command will take many hours to finish inference, and will dump predictions to ./data/, same as above.

Get Evaluations

Then, you can get aggregated measurements over those model predictions by:

./scripts/aggregate_predictions.sh --m_name robertabase_lm,robertalarge_lm,distilbert_lm,bertbase_lm,bertlarge_lm --d gender,country,ethnicity,religion --metrics pos_err,attr_err,model
./scripts/aggregate_predictions.sh --m_name robertabase,robertalarge,distilbert,bertbase,bertlarge --d gender,country,ethnicity,religion --metrics pos_err,attr_err,model
./scripts/aggregate_predictions.sh --m_name robertabase,robertalarge,distilbert,bertbase,bertlarge --extra newsqa --d gender,country,ethnicity,religion --metrics pos_err,attr_err,model

Now, you can jump to Step 4 of this readme for visualization.

Starting From Scratch

Here we describe how to start from templates, to model training, to model predictions, to evaluation, and to visualization.

1. Generating Underspecified Questions

This step covers how to generate underspecified examples (source.json) to be located at ./data/.

Our template looks like:

Paragraph: [x1] lives in the same city with [x2].
Subjects (x): Gerald; Jennifer.
Question: Who [attr]?
Attributes (attr): was a hunter; can never be a hunter

where templates, subjects, and attributes can be found at:

./word_lists/slots/       # templates
./word_lists/subjects/    # subjects
./word_lists/activities/  # attributes/activities

Gender-occupation for QA models

Let us generate the gender-occupation dataset for QA models as an example:

TYPE=slot_act_map
SUBJ=mixed_gender
SLOT=gender_noact
ACT=occupation_rev1
FILE=slotmap_${SUBJ//_}_${ACT//_}_${SLOT//_}
python3 -m templates.generate_underspecified_templates --template_type ${TYPE} \
  --subj $SUBJ --act $ACT --slot $SLOT \
  --output ./data/${FILE}.source.json

where the --slot, --subj, and --act options specify corresponding file names. When --subj mixed_gender, the module will look for male and female subject files and pair them up, such that each example contains one female name and one male name. Generated examples will be dumped into the json file which is kinda large (~few GB).

Gender-occupation for masked LMs

For masked LM, we will use a different set of templates and fillers. Special thing here is that we want the subjects to be single-wordpiece tokens, and templates are modified to incorporate mask. For instance:

TYPE=slot_act_map
SUBJ=mixed_gender_roberta
SLOT=gender_noact_lm
ACT=occupation_rev1
FILE=slotmap_${SUBJ//_}_${ACT//_}_${SLOT//_}
python3 -m templates.generate_underspecified_templates --template_type $TYPE \
  --subj $SUBJ --act $ACT --slot $SLOT --lm_mask <mask> \
  --output ./data/${FILE}.source.json

where gender_noact_lm points to a file under ./word_lists that contains templates for LMs; mixed_gender_roberta points to male_roberta and female_roberta files to, same as above, pair up gendered names.

Non-gender datasets for QA models

Other datasets are generated in a very similar way, just change the subjects/attributes/templates files. For reference, here is how to generate the nationality dataset:

TYPE=slot_act_map
SUBJ=country
SLOT=country_noact
ACT=biased_country
FILE=slotmap_${SUBJ//_}_${ACT//_}_${SLOT//_}
python3 -m templates.generate_underspecified_templates --template_type ${TYPE} \
  --subj $SUBJ --act $ACT --slot $SLOT \
  --output ./data/${FILE}.source.json

Non-gender datasets for masked LMs

Similar as above, simply point the --subj and --slot options to the right files. For instance:

TYPE=slot_act_map
SUBJ=country_roberta
SLOT=country_noact_lm
ACT=biased_country
FILE=slotmap_${SUBJ//_}_${ACT//_}_${SLOT//_}
python3 -m templates.generate_underspecified_templates --template_type $TYPE \
  --subj $SUBJ --act $ACT --slot $SLOT --lm_mask <mask> \
  --output ./data/${FILE}.source.json

where country_roberta points to a file that contains single-wordpiece country names for RoBERTa models. for BERT (including DistilBERT, base and large), use country_bert instead.

Special setup for NewsQA models

In some special cases it would be helpful to add some domain triggers. For instance, when using the NewsQA data released in Multi-QA, it is very important to add a special header, that widely occurs in the training data, to every example. To do that, simply specify --filler newsqa when calling the generate_underspecified_templates module.

2. Predicting on Underspecified Questions

This step covers how to use trained models to predict on the underspecified examples (source.json). Results will be saved as output.json files at ./data/.

We will use our pre-trained QA models that are automatically downloadable via HuggingFace's model hub. The complete list of pre-trained QA models are:

tli8hf/unqover-bert-base-uncased-newsqa
tli8hf/unqover-bert-base-uncased-squad
tli8hf/unqover-bert-large-uncased-newsqa
tli8hf/unqover-bert-large-uncased-squad
tli8hf/unqover-distilbert-base-uncased-newsqa
tli8hf/unqover-distilbert-base-uncased-squad
tli8hf/unqover-roberta-base-newsqa
tli8hf/unqover-roberta-base-squad
tli8hf/unqover-roberta-large-newsqa
tli8hf/unqover-roberta-large-squad

In case you need to train QA models from scratch, please jump to the Appendix below and look for model training instructions.

Here we will use the gender-occupation data as an illustration. The same script pattern applies to other datasets.

Using QA models on gender-occupation data

Let's say you want to use a RoBERTa base version fine-tuned on SQuAD:

TYPE=slot_act_map
SUBJ=mixed_gender
SLOT=gender_noact
ACT=occupation_rev1
FILE=slotmap_${SUBJ//_}_${ACT//_}_${SLOT//_}
MODEL=tli8hf/unqover-roberta-base-squad
python3 -m qa_hf.predict --gpuid [GPUID] \
  --hf_model ${MODEL} \
  --input ${FILE}.source.json --output ./data/robertabase_gender.output.json

where [GPUID] is the device index. A prediction json file will be dumped (kinda large, ~few GB). The --hf_model option points to the pre-trained QA model stored in HuggingFace's format.

Using masked LMs on gender-occupation data

For masked LM, run the following:

TYPE=slot_act_map
SUBJ=mixed_gender_roberta
SLOT=gender_noact_lm
ACT=occupation_rev1
FILE=slotmap_${SUBJ//_}_${ACT//_}_${SLOT//_}
MODEL=roberta-base
python3 -u -m lm.predict --gpuid [GPUID] --transformer_type $MODEL --use_he_she 1 \
  --input ${FILE}.source.json --output ./data/robertabase_lm_gender.output.json

where --use_he_she 1 specifies that the gendered pronouns (he/she) will be used along with gendered names. That is, e.g., the probability of name John is max(P(John), P(he)).

3. Aggregating Bias Scores

This step covers how to analyze over model predictions (output.json). Analysis results will be saved as log.txt at ./data/.

Evaluating on gender-occupation data

Here comes the meat. Here is a script to run analysis over predicted files from the 15 models (5 masked LM, 5 SQuAD, and 5 NewsQA) on the gender-occupatin data:

for DATA in gender lm_gender; do
for MODEL in robertabase robertalarge distilbert bertbase bertlarge; do
  python3 analysis.py --metrics subj_bias,pos_err,attr_err,model --input ./data/${MODEL}_${DATA}.output.json --group_by gender_act | tee ./data/${MODEL}_${DATA}.log.txt
done
done
DATA=gender
for MODEL in newsqa_robertabase newsqa_robertalarge newsqa_distilbert newsqa_bertbase newsqa_bertlarge; do
  python3 analysis.py --metrics subj_bias,pos_err,attr_err,model --input ./data/${MODEL}_${DATA}.output.json --group_by gender_act | tee ./data/${MODEL}_${DATA}.log.txt
done

where the --input points to the predicted files in json format. And --metrics specifies the list of metrics to report:

subj_bias   # measure biases about subjects where scores will be aggregated by ``--group_by`` option.
pos_err     # measure errors about subject positions (\delta score).
attr_err    # measure errors about attribute negations (\epsilon score).
model       # measure model-wise bias scores (i.e. \mu and \eta)

And the --group_by specifies that how bias scores (i.e. the C score in our paper) will be aggregated:

gender_act  # report \gamma(x, a) and \eta(x, a) scores where gendered names will be clustered into binary gender.
subj_act    # report \gamma(x, a) and \eta(x, a) scores
subj        # report \gamma(x) scores

Evaluating on nationality data

For nationality dataset, run the following:

for DATA in country lm_country; do
for MODEL in robertabase robertalarge distilbert bertbase bertlarge; do
  python3 analysis.py --metrics subj_bias,pos_err,attr_err,model --input ./data/${MODEL}_${DATA}.output.json --group_by subj_act | tee ./data/${MODEL}_${DATA}.log.txt
done
done
DATA=country
for MODEL in newsqa_robertabase newsqa_robertalarge newsqa_distilbert newsqa_bertbase newsqa_bertlarge; do
  python3 analysis.py --metrics subj_bias,pos_err,attr_err,model --input ./data/${MODEL}_${DATA}.output.json --group_by subj_act | tee ./data/${MODEL}_${DATA}.log.txt
done

where the --group_by subj_act will run analysis over each subject-attribute pairs. Alternatively, you can use --group_by subj to get analysis at subject level.

The same pattern applies to ethnicity and religion evaluations, just changing the --input to the corresponding predicted json files.

4. Visualization

This step covers how to visualize analysis results (log.txt)

Reproducing charts in our paper

After *.log.txt files got generated, you can reproduce the plots in the paper by jupyter notebooks located at ./visualization/. Specifically:

./visualization/Plot_curves.ipynb # plots the model-level bias intensities.
./visualization/Plot_ranks.ipynb   # plots avg./std. of subject ranks across different models using their gamma scores.

Note that to properly use Plot_curves.ipynb, the analysis logs (i.e. log.txt files) must be generated with option --group_by subj_act.

The Plot_ranks.ipynb file is self-contained. Data points there are generated by running analysis.py with --group_by subj option in order to rank subject scores (i.e. \gamma(x)).

More plotting

You can use some aggregated scores to create additional visualizations, e.g. subject-attribute bipartite graph. Such scores can be obtained by, e.g.,

python3 -u -m visualization.dump_bipartite_vis --files bertlarge_ethnicity.output.json

Appendix

Training your own QA models via HuggingFace

Say you want to train a DistilBERT model on SQuADv1.1 (without downstream distillation, i.e. what we used in the paper). First have SQuADv1.1 data (json files) located at ./data/squad/. Then:

GPUID=[GPUID]
DATA_DIR=/path/to/data/in/this/dir
MODEL_DIR=/path/to/models/in/this/dir
SQUAD_DIR=$DATA_DIR/squad/
CUDA_VISIBLE_DEVICES=$GPUID python3 -u run_squad.py \
  --model_type distilbert \
  --model_name_or_path distilbert-base-uncased \
  --do_train \
  --do_eval \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 20 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --seed 3435 \
  --do_lower_case \
  --overwrite_cache \
  --output_dir $MODEL_DIR/distilbert-base-uncased-squad/

For NewsQA data, you might have to hack the run_squad.py file a bit to make it run smoothly. In case it still doesn't work, you can use the method below.

Training your own QA models without HuggingFace's `run_squad.py`

Say you want to train a DistilBERT model on NewsQA data (again, no downstream distillation). First have NewsQA data (json files) located at ./data/newsqa/. We used the version from Multi-QA in SQuAD format. Then run:

python3 -u -m qa.preprocess --dir ./data/newsqa/ --batch_size 20 --max_seq_l 336 --verbose 0 --transformer_type distilbert-base-uncased --train_data NewsQA_train.json --dev_data NewsQA_dev.json --output newsqa.distilbert

GPUID=[GPUID]
LR=0.00003
BERT_TYPE=distilbert-base-uncased
MODEL=models/newsqa_seqtok_distilbert
python3 -u -m qa.train --gpuid $GPUID --dir data/newsqa/ \
--transformer_type $BERT_TYPE \
--train_data newsqa.distilbert.train.hdf5 \
--train_res newsqa.distilbert.train.tok_answer.txt,newsqa.distilbert.train.context.txt,newsqa.distilbert.train.query_id.txt \
--learning_rate $LR --epochs 2 \
--enc bert --cls linear --percent 1.0 --div_percent 0.9 \
--save_file $MODEL | tee ${MODEL}.txt

where [GPUID] specifies the GPU device index. It will dump a trained model (in hdf5 format) to the ./data/ directory.

Since the saved model is not in HuggingFace's format, we need a different prediction script:

python3 -u -m qa.predict --gpuid [GPUID] \
  --load_file models/${MODEL} --transformer_type $BERT_TYPE \
  --input ${FILE}.source.json --output ./data/${OUTPUT}.output.json

where the $FILE and $OUTPUT should be customized for each data and QA model. Please refer to the Step 2 above.

Alternatively, you can convert the HDF5 model into HuggingFace's format via:

python3 -u -m utils.convert_hdf5_to_hf --load_file ./models/newsqa_seqtok_distilbert --transformer_type distilbert-base-uncased --output ./models/distilbert-base-uncased-newsqa

and then use the prediction instructions described in Step 2.

Interactive Demo of QA and masked LM

There is an interactive demo that could come in handy. It will load a trained model, let you type in examples one by one, and predict them.

In case you want to play a bit with trained QA models (the ones trained without HuggingFace's run_squad.py), you can run, e.g.,:

python3 -u -m qa.demo --load_file ./models/newsqa_seqtok_distilbert --gpuid [GPUID]

where ./models/newsqa_seqtok_distilbert is the hdf5 model file trained above.

For pre-trained LM, you can run:

python3 -u -m lm.demo --transformer_type distilbert-base-uncased --gpuid [GPUID]

Notes & Known Issues

The country name Germany was mistakingly written as German. But given the large number of examples in nationality dataset, this typo would have a limited impact to our analysis results. A re-run with BERT-large SQuAD model turned out to only have ~0.0002 chages in gamma(x) across different countries. The change of gamma(Germany) was +0.0007, and the country rankings were not impacted at all.
The analysis results obtained from the fast way and the slow way will have some very minor numerical differences due to model conversions (from HDF5 format to HuggingFace's format).