Awesome

Auto-evaluation for Korean Chat by OpenAI API

모델의 한국어 대화 능력을 ChatGPT가 평가합니다!
ChatGPT evaluates the model's Korean conversation skills!

Update Log

2024-07-16: 1) Improved evaluation prompts. 2) Add hallucination-evaluation (BETA) function
- Removed URL-asking prompts
- Corrected the unnatural instructions.
- Substitute the evaluator GPT from GPT4 turbo to GPT4o.
- Re-evaluated 6 models, added KULLM3-20240604 (which is not released)

How to evaluate your model

Modify the oneclick_step1_step2.sh and execute it.

It automatically performs step1(generate), and step2(evaluation) with ChatGPT API usage.
It uses about 252 API call, with each 500 tokens.
It uses ChatGPT API, so it requires export OPENAI_API_KEY=<your_api_key> in bash.

Example

export OPENAI_API_KEY=<your_api_key>
sh oneclick_step1_step2.sh

Evaluation Results 1 (instruction-following ability)

mistralai/Mistral-7B-Instruct-v0.2 model is omitted because it frequently generates responses in English, even when the input is provided in Korean.
upstage/SOLAR-10.7B-Instruct-v1.0 model sometimes do similarly, therefore, its coherence score is relatively low.
Since GPT4o evaluates itself, the likelihood of it achieving a high score is relatively high.

Type	Model	Fluency (0 - 5)	Coherence (1 - 5)	Accuracy (1 - 5)	Completeness (1 - 5)	Overall Quality (1 - 5)
Closed	gpt-4o-2024-05-13	4.98	4.98	4.92	4.91	4.92
Closed	gpt-3.5-turbo-0125	4.93	4.92	4.63	4.62	4.69

Not released	KULLM3-20240604	4.88	4.86	4.48	4.52	4.58
Open	KULLM3	4.87	4.83	4.46	4.49	4.54
Open	SOLAR-10.7B-Inst	3.87	3.55	3.58	3.57	3.44
Open	KoAlpaca-1.1b	4.06	3.57	2.72	2.67	2.83

Evaluation Results 2 (Hallucination-rejecting ability) (BETA)

Vertical axis represents hallucination rate. (lower is better)
'Global' category means the micro-average of all other categories.

How to Reproduce

Instruction-following ability evaluation

Edit the contents of the oneclick_step1_step2.sh files to suit your purpose, and execute!

Hallucination-rejecting ability evaluation

Set --eval_category=hallucination instead of --eval_category=chat

If you want to use pre-generated model answers and ChatGPT evaluations,
you can remove 'step1(generate)' part and modify 'step2(api-evaluation)' part in oneclick_step1_step2.sh to disable the --use_api options.
--use_api option executes ChatGPT API call which auto-saves result.

Supported Models

Model requires specific generate code block
- nlpai-lab/kullm-polyglot-12.8b-v2
- upstage/SOLAR-10.7B-Instruct-v1.0
- mistralai/Mistral-7B-Instruct-v0.2 (excluded in benchmark since it generates English even though given Korean prompt)
- beomi/KoAlpaca-Polyglot-12.8B
OpenAI Models (gpt-3.5-turbo, gpt-4-turbo)
Any other generative models with default chat template (which can do ```tokenizer.apply_chat_template``)

Default Evaluation Template

Korean evaluation form wasn't pretty good, in our experiment result.
So we adopted english form and 'Be Korean language expert' system prompt.

System Prompt

You're a helpful assistant and a Korean language expert.

User Prompt

You will be given evaluation instruction, input and AI-generated response.
Your task is to rate the response on given metric.
Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:
- Fluency (1-5): The quality of the language used in the translation. A high-quality response should be grammatically correct, idiomatic, and free from spelling and punctuation errors.
- Coherence (1-5): A high score indicates that the response maintains consistent context. A low score is given if the response shifts context or language inappropriately from instruction(e.g. instruction's language is Korean, but response is English).
- Accuracy (1-5) - The correctness of the answer. The answer should be factually correct and directly answer the question asked
- Completeness (1-5) - The extent to which the response covers all aspects of the question. The response should not just address one part of the question, but should provide a comprehensive response.
- Overall Quality (1-5) - The overall effectiveness and excellence of the response, integrating considerations of all above criteria.

Evaluation Steps:
1. Read the instruction and input carefully and understand what it is asking.
2. Read the AI-generated response and Evaluation Criteria.
3. Assign a score for each criterion on a scale of 1 to 5, where 1 is the lowest and 5 is the highest.

Instruction:
{instruction}

Input:
{input}

Response:
{response}

Evaluation Form (scores ONLY):
- Fluency (1-5):
- Coherence (1-5):
- Accuracy (1-5):
- Completeness (1-5):
- Overall Quality (1-5):

Requirements

torch
transformers
batched-chatgpt
fire
jsonlines

pip install torch transformers batched-chatgpt fire jsonlines