Home

Awesome

Auto-evaluation for Korean Chat by OpenAI API

모델의 한국어 대화 능력을 ChatGPT가 평가합니다!
ChatGPT evaluates the model's Korean conversation skills!

Update Log

How to evaluate your model

Modify the oneclick_step1_step2.sh and execute it.

Example

export OPENAI_API_KEY=<your_api_key>
sh oneclick_step1_step2.sh

Evaluation Results 1 (instruction-following ability)

TypeModelFluency (0 - 5)Coherence (1 - 5)Accuracy (1 - 5)Completeness (1 - 5)Overall Quality (1 - 5)
Closedgpt-4o-2024-05-134.984.984.924.914.92
Closedgpt-3.5-turbo-01254.934.924.634.624.69
Not releasedKULLM3-202406044.884.864.484.524.58
OpenKULLM34.874.834.464.494.54
OpenSOLAR-10.7B-Inst3.873.553.583.573.44
OpenKoAlpaca-1.1b4.063.572.722.672.83
<p align="center"> <img src="assets/chat_evaluation.png" /> </p>

Evaluation Results 2 (Hallucination-rejecting ability) (BETA)

<p align="center"> <img src="assets/halluci_evaluation.png" /> </p>

How to Reproduce

Instruction-following ability evaluation

Hallucination-rejecting ability evaluation

If you want to use pre-generated model answers and ChatGPT evaluations,
you can remove 'step1(generate)' part and modify 'step2(api-evaluation)' part in oneclick_step1_step2.sh to disable the --use_api options.
--use_api option executes ChatGPT API call which auto-saves result.

Supported Models

Default Evaluation Template

Korean evaluation form wasn't pretty good, in our experiment result.
So we adopted english form and 'Be Korean language expert' system prompt.

System Prompt

You're a helpful assistant and a Korean language expert.

User Prompt

You will be given evaluation instruction, input and AI-generated response.
Your task is to rate the response on given metric.
Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:
- Fluency (1-5): The quality of the language used in the translation. A high-quality response should be grammatically correct, idiomatic, and free from spelling and punctuation errors.
- Coherence (1-5): A high score indicates that the response maintains consistent context. A low score is given if the response shifts context or language inappropriately from instruction(e.g. instruction's language is Korean, but response is English).
- Accuracy (1-5) - The correctness of the answer. The answer should be factually correct and directly answer the question asked
- Completeness (1-5) - The extent to which the response covers all aspects of the question. The response should not just address one part of the question, but should provide a comprehensive response.
- Overall Quality (1-5) - The overall effectiveness and excellence of the response, integrating considerations of all above criteria.

Evaluation Steps:
1. Read the instruction and input carefully and understand what it is asking.
2. Read the AI-generated response and Evaluation Criteria.
3. Assign a score for each criterion on a scale of 1 to 5, where 1 is the lowest and 5 is the highest.

Instruction:
{instruction}

Input:
{input}

Response:
{response}

Evaluation Form (scores ONLY):
- Fluency (1-5):
- Coherence (1-5):
- Accuracy (1-5):
- Completeness (1-5):
- Overall Quality (1-5):

Requirements

pip install torch transformers batched-chatgpt fire jsonlines