Home

Awesome

WenMind Benchmark

WenMind is a comprehensive benchmark dedicated for evaluating Large Language Models (LLMs) in Chinese Classical Literature and Language Arts (CCLLA). WenMind covers the sub-domains of Ancient Prose, Ancient Poetry, and Ancient Literary Culture, comprising 4,875 question-answer pairs, spanning 42 fine-grained tasks (as shown in the figure 1), 3 question formats (Fill-in-the-Blank questions, Multiple-Choice questions and Question-and-Answer questions), and 2 evaluation scenarios (domain-oriented and capability-oriented).

<p align="center"> <img src="https://github.com/SCUT-DLVCLab/WenMind/blob/main/Images/WenMind_Overall.png" width="800"/> <p> <p align="center"> <strong>Figure 1: Overview of WenMind Benchmark, which covers 3 sub-domains and 42 fine-gained tasks.</strong> <p>

Download

You can obtain the complete WenMind evaluation dataset from WenMind Benchmark folder on GitHub.

Data Format

  {
    "id": 2464,
    "domain": "ancient literary culture",
    "capability": "knowledge",
    "question_format": "QA",
    "coarse_grained_task_zh": "成语",
    "coarse_grained_task_en": "idiom",
    "fine_grained_task_zh": "成语解释",
    "fine_grained_task_en": "idiom explanation",
    "question": "解释下面成语的意思:\n暮去朝来",
    "answer": "黄昏过去,清晨又到来。形容时光流逝。"
  }

The following is an explanation of the various fields in the data samples:

Task List

T1-1: Inverted Sentence Structure (倒装句语序)

T1-2: Elliptical Sentence (省略句)

T1-3: Inverted Sentence Types (倒装句类型)

T1-4: Sentence Structure Identification (判断句式)

T2: Classical Chinese to Modern Chinese (文白翻译)

T3: Modern Chinese to Classical Chinese (白文翻译)

T4: Named Entity Recognition (命名实体识别)

T5: Punctuation (句读)

T6: Topic Classification (主题分类)

T7: Word Explanation (字词解释)

T8: Reading Comprehension (阅读理解)

T9: Function Words (虚词)

T10: Homophones (通假字)

T11: Polysemy (单字多义)

T12: Classical Chinese Writing (文言文写作)

T13-1: Appreciation Exam Questions (赏析真题)

T13-2: Free Appreciation (自由赏析)

T14-1: Poetry Writing (诗创作)

T14-2: Ci Writing (词创作)

T14-3: Qu Writing (曲创作)

T15-1: Content Q&A (内容问答)

T15-2: Title and Author Q&A (题目作者问答)

T15-3: Write the Next Sentence (下句默写)

T15-4: Write the Previous Sentence (上句默写)

T15-5: Comprehension Dictation (理解性默写)

T15-6: Genre Judgment (判断体裁)

T16: Ancient Poetry Translation (古诗词翻译)

T17: Sentiment Classification (情感分类)

T18: Ancient Poetry to English (古诗词英文翻译)

T19: Poet Introduction (诗人介绍)

T20: Analysis of Imagery (意象解析)

T21-1: Couplet Following (接下联)

T21-2: Couplet Writing (主题创作)

T21-3: HengPi Writing (拟横批)

T22-1: Synonyms (近义词)

T22-2: The Origin of Idiom (成语出处)

T22-3: Idiom Finding (成语蕴含)

T22-4: Idiom Explanation (解释含义)

T23: Riddle (谜语)

T24: Xiehouyu (歇后语)

T25: Historical Chinese Phonology (古汉语音韵)

T26: Knowledge of Sinology Q&A (国学常识问答)

Data Construction

The construction pipeline of WenMind includes data collection and data processing, as illustrated in Figure 2.

<p align="center"> <img src="https://github.com/SCUT-DLVCLab/WenMind/blob/main/Images/Data_Construction.png" width="550"/> <p> <p align="center"> <strong>Figure 2: Construction pipeline of WenMind Benchmark.</strong> <p>

Data Statistics

Table 1 provides the statistics of the WenMind dataset.

<div align="center">

Table 1: The statistics of the WenMind Benchmark. "Q" represents "Question" and "A" represents "Answer".

<table> <thead> <tr> <th align="left"><strong>Domain</strong></th> <th align="center"><strong>Tasks</strong></th> <th align="center"><strong>#Q</strong></th> <th align="center"><strong>Max. #Q</strong></th> <th align="center"><strong>Min. #Q</strong></th> <th align="center"><strong>Avg. Q Tokens</strong></th> <th align="center"><strong>Avg. A Tokens</strong></th> </tr> </thead> <tbody> <tr> <td align="left">Ancient Prose</td> <td align="center">15</td> <td align="center">1,900</td> <td align="center">200</td> <td align="center">7</td> <td align="center">107.51</td> <td align="center">62.12</td> </tr> <tr> <td align="left">Ancient Poetry</td> <td align="center">16</td> <td align="center">1,845</td> <td align="center">200</td> <td align="center">20</td> <td align="center">73.42</td> <td align="center">94.93</td> </tr> <tr> <td align="left">Ancient Literary Culture</td> <td align="center">11</td> <td align="center">1,130</td> <td align="center">100</td> <td align="center">100</td> <td align="center">26.68</td> <td align="center">14.26</td> </tr> <tr> <td align="left"><strong>Overall</strong></td> <td align="center">42</td> <td align="center">4,875</td> <td align="center">200</td> <td align="center">7</td> <td align="center">75.87</td> <td align="center">63.44</td> </tr> </tbody> </table> </div>

Inference

a. Obtain the model’s responses

Open-source Model

For open-source models, we perform inference locally, only requiring the model path and the output file path for the answers.

--model_path The path to the model, defaults to loading from huggingface
--output_path The file path for the model's answer output, defaults to {model_name}_result.json

e.g.

CUDA_VISIBLE_DEVICES=0,1 python Evaluation_Code/Inference/Test_Baichuan2-7B-Chat.py \  
    --model_path baichuan-inc/Baichuan2-7B-Chat \  
    --output_path Baichuan2-7B-Chat_result.json

API Model

For GPT-3.5 and GPT-4 models, provide two parameters: api_base and api_key.
For ERNIE-3.5 and ERNIE-4.0 models, provide two parameters: api_key and secret_key.
For Spark models, provide three parameters: api_key, secret_key, and appid.
Refer to the official documentation of each API model for details.

e.g.

python Test_ERNIE-3.5-8K-0329.py \
    --API_KEY {api_key} \
    --SECRET_KEY {secret_key} \
    --output_path {output_path}

b. Use ERNIE-3.5 to score the responses

Step 1: Check whether the LLM response file is consistent with the format of the JSON/LLM_Response_Examples.json file.

Step 2: Open the Evaluation_Code/LLM_Scoring.py file, input the API_KEY and SECRET_KEY for the scoring model ERNIE-3.5, replace LLM_response_path with the storage path of the LLM response file, replace LLM_score_path with the path where the scoring results will be saved, and replace LLM_prompt_path with the storage path of JSON/Task_Score_Prompt.json.

Step 3: Run the following command to obtain the scoring results:

python Evaluation_Code/LLM_Scoring.py 

c. Calculate the model’s score

Step 1: Check whether the scoring file is consistent with the format of the JSON/LLM_Score_Examples.json file.

Step 2: Open the Evaluation_Code/Calculate_Score.py file and replace LLM_score_path with the storage path of the scoring file.

Step 3: Run the following command to obtain the model's score:

python Evaluation_Code/Calculate_Score.py 

Evaluation Result

<p align="center"> <strong>Table 2: Results of all evaluated models on different domains and capabilities.</strong> <p> <p align="center"> <img src="https://github.com/SCUT-DLVCLab/WenMind/blob/main/Images/Evaluation_Result.png" width="750"/> <p>

Acknowledgement

License

License: MIT

The work is licensed under a MIT License.

License: CC BY-NC-SA 4.0

The WenMind benchmark is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.