Home

Awesome

<!-- START doctoc generated TOC please keep comment here to allow auto update --> <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->

Table of Contents generated with DocToc

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

BUG Dataset <img src="https://user-images.githubusercontent.com/6629995/132018898-038ec717-264d-4da3-a0b8-651b851f6b64.png" width="30" /><img src="https://user-images.githubusercontent.com/6629995/132017358-dea44bba-1487-464d-a9e1-4d534204570c.png" width="30" /><img src="https://user-images.githubusercontent.com/6629995/132018731-6ec8c4e3-12ac-474c-ae6c-03c1311777f4.png" width="30" />

A Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation (Levy et al., Findings of EMNLP 2021).

BUG was collected semi-automatically from different real-world corpora, designed to be challenging in terms of soceital gender role assignements for machine translation and coreference resolution.

Setup

  1. Unzip data.tar.gz this should create a data folder with the following files:
    • balanced_BUG.csv
    • full_BUG.csv
    • gold_BUG.csv
  2. Setup a python 3.x environment and install requirements:
pip install -r requirements.txt

Dataset Partitions

NOTE: These partitions vary slightly from those reported in the paper due improvments and bug fixes post submission. For reprducibility's sake, you can access the dataset from the submission here.

<img src="https://user-images.githubusercontent.com/6629995/132018898-038ec717-264d-4da3-a0b8-651b851f6b64.png" width="20" /> Full BUG

105,687 sentences with a human entity, identified by their profession and a gendered pronoun.

<img src="https://user-images.githubusercontent.com/6629995/132017358-dea44bba-1487-464d-a9e1-4d534204570c.png" width="20" /> Gold BUG

1,717 sentences, the gold-quality human-validated samples.

<img src="https://user-images.githubusercontent.com/6629995/132018731-6ec8c4e3-12ac-474c-ae6c-03c1311777f4.png" width="20" /> Balanced BUG

25,504 sentences, randomly sampled from Full BUG to ensure balance between male and female entities and between stereotypical and non-stereotypical gender role assignments.

Dataset Format

Each file in the data folder is a csv file adhering to the following format:

ColumnHeaderDescription
1sentence_textText of sentences with a human entity, identified by their profession and a gendered pronoun
2tokensList of tokens (using spacy tokenizer)
3professionThe entity in the sentence
4gThe pronoun in the sentence
5profession_first_indexWords offset of profession in sentence
6g_first_indexWords offset of pronoun in sentence
7predicted gender'male'/'female' determined by the pronoun
8stereotype-1/0/1 for anti-stereotype, neutral and stereotype sentence
9distanceThe abs distance in words between pronoun and profession
10num_of_pronounsNumber of pronouns in the sentence
11corpusThe corpus from which the sentence is taken
12data_indexThe query index of the pattern of the sentence

Evaluations

See below instructions for reproducing our evaluations on BUG.

Coreference

  1. Download the Spanbert predictions from this link.
  2. Unzip and put coref_preds.jsonl in in the predictions/ folder.
  3. From src/evaluations/, run python evaluate_coref.py --in=../../predictions/coref_preds.jsonl --out=../../visualizations/delta_s_by_dist.png.
  4. This should reproduce the coreference evaluation figure.

Conversions

CoNLL

To convert each data partition to CoNLL format run:

python convert_to_conll.py --in=path/to/input/file --out=path/to/output/file

For example, try:

python convert_to_conll.py --in=../../data/gold_BUG.csv --out=./gold_bug.conll

Filter from SPIKE

  1. Download the wanted SPIKE csv files and save them all in the same directory (directory_path).
  2. Make sure the name of each file end with \_<corpusquery><x>.csv where corpus is the name of the SPIKE dataset and x is the number of query you entered on search (for example - myspikedata_wikipedia18.csv).
  3. From src/evaluations/, run python Analyze.py directory_path.
  4. This should reproduce the full dataset and balanced dataset.

Citing

@misc{levy2021collecting,
      title={Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation}, 
      author={Shahar Levy and Koren Lazar and Gabriel Stanovsky},
      year={2021},
      eprint={2109.03858},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}