Awesome
nlg-bias
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, Nanyun Peng (EMNLP 2019).
Data
regard/
contains samples annotated with regard, and sentiment/
contains samples annotated with sentiment.
The train_other.tsv
files include samples with the "other" label. We use train.tsv
to train the original models described in the paper, but also include the more robust models trained with train_other.tsv
in this repo.
In the TSV files, the first column is the annotation (-1 for negative, 0 for neutral, 1 for positive, 2 for other), and the second column is the sample.
For more details on annotation guidelines and process, please look through the paper.
Models
Update (Dec. 2020): Using more annotated data collected for our work Towards Controllable Biases in Language Generation, we have trained an updated regard classifier (1GB) that you can download into models/
.
This model consists of one BERT classifier (no ensemble) trained on 1.7K samples and has a (new) dev set acc. of 0.80 and a (new) test set acc. of 0.84. The model's accuracy on the (old) test set is 0.87 (comparable with the test set of the ensemble models below). This model can be run with the following command:
python scripts/run_classifier.py \
--data_dir data/regard \
--model_type bert \
--model_name_or_path models/bert_regard_v2_gpt2/checkpoint-300 \
--output_dir models/bert_regard_v2_gpt2 \
--max_seq_length 128 \
--do_predict \
--test_file [TEST_FILE] \
--do_lower_case \
--per_gpu_eval_batch_size 32 \
--model_version 2
Older Models: Each of these models are an ensemble of three BERT bert models and can be run with scripts/eval.py
as detailed below.
- Download the regard2 model here (3.12 GB) into
models/
. - Download the regard1 model here (3.12 GB) into
models/
. - Download the sentiment2 model here (3.12 GB) into
models/
. - Download the sentiment1 model here (3.12 GB) into
models/
.
There are four types of models: regard1, regard2, sentiment1, and sentiment2. All are ensemble models that take the majority label of three model runs. regard1 and sentiment1 are trained on the respective train.tsv
files (as described in the paper). regard2 and sentiment2 are trained on the respective train_other.tsv
files. We recommend using regard2 and sentiment2, as they appear to be more quantitatively and qualitatively robust.
model_type | dev acc. | test acc. |
---|---|---|
regard2 | 0.92 | 0.80 |
regard1 | 0.85 | 0.77 |
sentiment2 | 0.87 | 0.77 |
sentiment1 | 0.77 | 0.77 |
Code
Setup
To create a clean environment and install necessary dependencies:
conda create -n biases python=3.7
conda activate biases
conda install pip
conda install pytorch=1.2.0 -c pytorch
pip install -r requirements.txt
Run models (using ensemble classifiers)
If we have a file of samples, e.g., small_gpt2_generated_samples.tsv
, and a corresponding file where the demographic groups have been masked, small_gpt2_generated_samples.tsv.XYZ
, we can run eval.py
:
python scripts/eval.py --sample_file data/generated_samples/sample.tsv --model_type regard2
This will use the regard2 model to label all samples in sample.tsv
and subsequently evaluate the amount of biases towards different demographics groups.