Home

Awesome

GPTScore: Evaluate as You Desire

This is the Source Code of Paper: GPTScore: Evaluate as You Desire.

What is GPTScore?

GPTScore is a novel evaluation framework that utilizes the emergent abilities (e.g., zero-shot instruction) of Generative Pre-Trained models to Score generated texts.

<img src="./fig/framework.gif" width="800" class="center">

GPTScore evaluation framework support:

  1. Customizable. Customized instructions and demonstrations enable the evaluation of new aspects without labeled datasets;
  2. Multifaceted. One evaluator performs multifaceted evaluations;
  3. Training-free.

What PLMs does GPTScore support?

We explored 19 Pre-trained Language Models (PLMs) ranging in size from 80M (FLAN-T5-Small) to 175B (GPT3) to design GPTScore. <br> The PLMs studied in this paper are listed as follows:

ModelParameterEvaluator NameModelParameterEvaluator Name
GPT3OPT
text-ada-001350Mgpt3_scoreOPT350M350Mopt350m_score
text-babbage-0011.3Bgpt3_scoreOPT-1.3B1.3Bopt1_3B_score
text-curie-0016.7Bgpt3_scoreOPT-6.7B6.7Bopt6_7B_score
text-davinci-001175Bgpt3_scoreOPT-13B13Bopt13B_score
text-davinci-003175Bgpt3_scoreOPT-66B66Bopt66B_score
FLAN-T5GPT2
FT5-small80Mflan_small_scoreGPT2-M355Mgpt2_medium_score
FT5-base250Mflan_base_scoreGPT2-L774Mgpt2_large_score
FT5-L770Mflan_large_scoreGPT2-XL1.5Bgpt2_xl_score
FT5-XL3Bflan_xl_scoreGPT-J-6B6BgptJ6B_score
FT5-XXL11Bflan_xxl_score

Usage

Use the GPT3-based model as the evaluator

Take the evaluation of GPT3-text-curie-001 model as an example.

1. GPTScore with Instruction and Demonstration

Set both the use_demo and use_ist as True. </br>

python score_d2t.py 
--dataname "BAGEL" 
--use_demo True 
--use_ist True 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

2. GPTScore with only Instruction

Set the use_ist to True and use_demo to False. </br>

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist True 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

3. GPTScore without both Instruction and Demonstration

Set the use_ist to False and use_demo to False. </br>

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist False 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

Use the non-GPT3-based model (e.g., OPT) as the evaluator

Here, we take the evaluation of OPT350M model as an example.

1. opt350m_score with Instruction and Demonstration

Set both the use_demo and use_ist as True. </br>

python score_d2t.py 
--dataname "BAGEL" 
--use_demo True 
--use_ist True 
--opt350m_score True 
--out_dir_name "optScore_based"  
--aspect 'quality'

2. opt350m_score with only Instruction

Set the use_ist to True and use_demo to False. </br>

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist True 
--opt350m_score True 
--out_dir_name "optScore_based"  
--aspect 'quality'

3. opt350m_score without both Instruction and Demonstration

Set the use_ist to False and use_demo to False. </br>

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist False 
--opt350m_score True 
--out_dir_name "optScore_based"  
--aspect 'quality'

Bib

@article{fu2023gptscore,
  title={GPTScore: Evaluate as You Desire},
  author={Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei},
  journal={arXiv preprint arXiv:2302.04166},
  year={2023}
}