Awesome

GPTScore: Evaluate as You Desire

This is the Source Code of Paper: GPTScore: Evaluate as You Desire.

What is GPTScore?

GPTScore is a novel evaluation framework that utilizes the emergent abilities (e.g., zero-shot instruction) of Generative Pre-Trained models to Score generated texts.

GPTScore evaluation framework support:

Customizable. Customized instructions and demonstrations enable the evaluation of new aspects without labeled datasets;
Multifaceted. One evaluator performs multifaceted evaluations;
Training-free.

What PLMs does GPTScore support?

We explored 19 Pre-trained Language Models (PLMs) ranging in size from 80M (FLAN-T5-Small) to 175B (GPT3) to design GPTScore. The PLMs studied in this paper are listed as follows:

Model	Parameter	Evaluator Name	Model	Parameter	Evaluator Name
GPT3			OPT
text-ada-001	350M	gpt3_score	OPT350M	350M	opt350m_score
text-babbage-001	1.3B	gpt3_score	OPT-1.3B	1.3B	opt1_3B_score
text-curie-001	6.7B	gpt3_score	OPT-6.7B	6.7B	opt6_7B_score
text-davinci-001	175B	gpt3_score	OPT-13B	13B	opt13B_score
text-davinci-003	175B	gpt3_score	OPT-66B	66B	opt66B_score
FLAN-T5			GPT2
FT5-small	80M	flan_small_score	GPT2-M	355M	gpt2_medium_score
FT5-base	250M	flan_base_score	GPT2-L	774M	gpt2_large_score
FT5-L	770M	flan_large_score	GPT2-XL	1.5B	gpt2_xl_score
FT5-XL	3B	flan_xl_score	GPT-J-6B	6B	gptJ6B_score
FT5-XXL	11B	flan_xxl_score

Evaluator Name indicates the name of the evaluator corresponding to the Model name in the first column.

Usage

Use the GPT3-based model as the evaluator

Take the evaluation of GPT3-text-curie-001 model as an example.

Setting gpt3_score to True: the GPTScore evaluator uses a GPT3-based PLM.
Setting gpt3model to curie: the text-curie-001 model is utilized.
out_dir_name: set the folder for saving scoring results.
dataname: set the dataset name for evaluation (e.g., BAGEL).
aspect: set the aspect name to be evaluated (e.g., quality).

1. GPTScore with Instruction and Demonstration

Set both the use_demo and use_ist as True.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo True 
--use_ist True 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

2. GPTScore with only Instruction

Set the use_ist to True and use_demo to False.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist True 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

3. GPTScore without both Instruction and Demonstration

Set the use_ist to False and use_demo to False.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist False 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

Use the non-GPT3-based model (e.g., OPT) as the evaluator

Here, we take the evaluation of OPT350M model as an example.

Setting opt350m_score to True: use the evaluator named opt350m_score.
out_dir_name: set the folder for saving scoring results.
dataname: set the dataset name for evaluation (e.g., BAGEL).
aspect: set the aspect name to be evaluated (e.g., quality).

1. `opt350m_score` with Instruction and Demonstration