Home

Awesome

Safety Score for Pre-Trained Language Models

Paper: An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models (ACL 2023, TrustNLP workshop)

This repository contains the code used to measure safety scores for pre-trained language models based on ToxiGen human annotated dataset and ImplicitHate dataset.

Evaluation Dataset

Setup

There are few specific dependencies to install before runnung the safety score calculator, you can install them with the command pip install -r requirements.txt.

How to calculate safety score

Now you can run the following script:

python safety_score.py \
   --data data/toxiGen.json \ # Path to evaluation dataset
   --output results \ # local path to a directory for saving results
   --model gpt2 \ # pre-trained model name or local path
   --lmHead clm \ # Type of language model head, i.e. causal or masked
   --force # overwrites the output path if it already exists.

Two files will be saved in the output path:

For example, the contetn of 'safety_scores.json' after running the above script is

{"asian": 0.3694922836054574, "black": 0.36662849289967936, "chinese": 0.3731038121619839, "jewish": 0.40661968642101093, "latino": 0.22831884057971014, "lgbtq": 0.2701839434577746, "mental dis": 0.22755361686659398, "mexican": 0.23524720893141945, "middle-eastern": 0.2604830744365628, "muslim": 0.32320982365959877, "native-american": 0.24511818257746595, "physical dis": 0.22460258469801234, "women": 0.23225019516003123}

Safety scores based on ToxiGen

Here are the results based on the ToxiGen dataset:

model nameAsianBlackChineseJewishLatinoLGBTQMentally disabledMexicanMiddle-EasternMuslimNative-AmericanPhysically disabledWomenAverage
BERT-large-uncased0.39041020.3180490.3853270.3917470.2481960.3152750.2604230.2697840.300530.3073030.2542550.2536740.2436960.302975
BERT-base-uncased0.39553310.3320770.3879880.3940260.2539570.3147650.2489670.2732780.2911690.3025340.2477240.2449230.2428080.302288
DistiBERT-uncased0.40664710.3242670.402190.4063930.2722030.2724150.2002690.28260.2947160.2895550.2649960.2182250.2476090.298622
MobileBERT0.37172890.3196980.3846020.4053740.2463910.2862680.1990570.2662150.2805960.3009070.2416440.2181050.2480780.289897
BERT-large-cased0.38614990.2948920.3629910.3404230.2266960.2968580.2242270.2451580.2075290.2517460.1730390.2176250.206450.264137
BERT-base-cased0.39190120.3161480.3670580.3559180.2400720.3115030.2270470.2567970.2080230.2720930.1765470.2248540.2142080.274013
DistiBERT-cased0.40329740.3104210.3957480.3477810.2720.271430.197790.2987580.2573180.2119650.2382030.2074590.2466040.281444
RoBERTA-Large0.43807180.3858910.4363980.424690.2540290.2945810.2639150.2656450.3108780.2818880.2544560.262090.2615240.318004
RoBERTA-Base0.48922150.4471830.4931850.492090.3202320.3430250.3031850.3522250.3597690.3533660.305070.3111230.3044110.37493
DistilRoBERTa0.49711370.4881240.4894910.442930.3639280.3903250.3643190.3673390.4195920.4129080.355750.3720840.3569280.409295
Electra-large-Generator0.36654740.2935070.3788860.3664030.2491740.2959750.2302960.2773030.2577670.2833150.2283140.233750.2240530.283484
Electra-base-Generator0.37030710.3097110.3763140.3828470.2543410.2970050.2190170.2840240.2702930.2910830.2335090.2266410.2280250.287932
Electra-small-Generator0.3907190.3329360.4177990.3823650.2711230.3378940.2444840.3065240.2852880.3092880.2535540.2479080.2539130.310292
Albert-xxlarge-v20.44642720.4095170.4481820.4843490.2918330.3383250.26820.3142140.3428890.3212110.3223920.3023470.2788640.351442
Albert-xlarge-v20.42854480.4046950.427120.4718260.2918120.3741620.2624060.3132070.3384210.3290930.3696980.2752180.2936280.352295
Albert-large-v20.47490170.4457740.4659460.4897120.3259780.4143260.336440.3521110.3846860.3631610.3875050.3348240.3240340.392262
Albert-base-v20.4729420.4363610.4768280.4944530.3425720.3909250.3052440.3790350.3707240.3618620.350940.3254730.3165790.386457
GPT2-xl0.36366640.3662390.3533610.4017660.2072030.2718490.2455970.2139440.2386410.311030.2373010.2314720.2218680.281841
GPT2-large0.36499770.3639830.3669920.4028270.2111160.2795510.2433610.2209690.2399880.3117440.2393720.2337020.227430.285079
GPT2-medium0.36364510.3527140.3628810.3971670.213920.2758930.2368280.2211970.2320640.3040910.2331080.2196030.2264730.279968
GPT2-small0.36949230.3666280.3731040.406620.2283190.2701840.2275540.2352470.2604610.323210.2451180.2246030.232250.289445
DistilGPT20.38534580.3816190.3837660.4187470.2432610.2819410.239560.2581830.2878690.3431280.2598510.2412070.2273420.303986
XLNet-large0.38468010.3282980.3789520.3770310.2676810.2875480.2263860.2772080.2385290.3011640.2352790.2088740.231440.287928
XLNet-base0.38412090.3339780.3813920.3911810.2814130.2971070.2163290.2927390.2446130.2968660.2311030.2121230.2345040.292113
PTLMs Average0.40568390.3609460.4040210.4111940.2657270.312880.2496210.2843210.2884310.3097710.2641140.2519960.2538630.312505

Safety scores based on ImplicitHate

Here are the results based on the ImplicitHate dataset:

model nameSafety Score
BERT-large-uncased0.332300992
BERT-base-uncased0.335931145
DistilBERT-base-uncased0.336185856
mobileBERT0.335289526
BERT-large-cased0.300331164
BERT-base-cased0.308677306
DistilBERT-base-cased0.329417992
RoBERTa-large0.353298215
RoBERTa-base0.376362527
DistilRoBERTa0.390526523
ELECTRA-large-generator0.332349693
ELECTRA-base-generator0.332561139
ELECTRA-small-generator0.334555207
ALBERT-xxlarge-v20.35294267
ALBERT-xlarge-v20.358772426
ALBERT-large-v20.352241738
ALBERT-base-v20.339738782
GPT-2-xl0.2539317
GPT-2-large0.255463608
GPT-2-medium0.255785509
GPT-20.259990915
DistilGPT-20.26304632
XLNet-large-cased0.269394327
XLNet-base-cased0.271851141

Citation

Please use the following to cite this work:

@misc{hosseini2023empirical,
      title={An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models}, 
      author={Saghar Hosseini and Hamid Palangi and Ahmed Hassan Awadallah},
      year={2023},
      eprint={2301.09211},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}