Home

Awesome

Gaokao Benchmark for AI

<p align="center"> <img src="https://user-images.githubusercontent.com/59123869/173433076-de9036e2-3383-4670-b142-c5f9c27f54ed.png" width="500"/> </p>

Table of contents

<!--ts--> <!--te-->

What is Gaokao Benchmark

Gaokao Benchmark aims to track how well we make progress towards human-level intelligence. It can not only provide a comprehensive evaluation of different tasks and domains that are practically useful in a real-world scenario, but also provide rich human performance so that AI systems can be directly compared with humans over time.

How to download Gaokao datasets

pip install --upgrade pip
pip install datalabs

from datalabs import load_dataset
dataset = load_dataset("gaokao2018_np1", "listening")

where

Different types of questions in Gaokao English are formatted diversely. Below we will give detailed description. Example questions can be found in INPUTS.md.

How to Evaluate your Gaokao AI system

Preprocess system output for each question type

We have provided multiple ways for Gaokao system evaluation, before which, system outputs from different question types (i.e., subdataset) should be processed into specific formats. Please see examples in data/system_outputs.

Method 1: Using ExplainaBoard SDK

Install ExplainaBoard

pip install --upgrade pip  # recommending the newest version of pip.
pip install explainaboard
(1) Evaluate listening
explainaboard --task qa-multiple-choice --dataset gaokao2018_np1 --sub_dataset listening --system_outputs ./data/system_outputs/rst_2018_quanguojuan1_listening.json > report.json
(2) Evaluate cloze-multiple-choice
explainaboard --task cloze-multiple-choice --dataset gaokao2018_np1 --sub_dataset cloze-multiple-choice --system_outputs ./data/system_outputs/rst_2018_quanguojuan1_cloze_choice.json > report.json
(3) Evaluate cloze-hint
explainaboard --task cloze-generative --dataset gaokao2018_np1 --sub_dataset cloze-hint --system_outputs ./data/system_outputs/rst_2018_quanguojuan1_cloze_hint.json > report.json
(4) Evaluate reading-multiple-choice
explainaboard --task qa-multiple-choice --dataset gaokao2018_np1 --sub_dataset reading-multiple-choice --system_outputs ./data/system_outputs/rst_2018_quanguojuan1_reading_mc.json > report.json
(5) Evaluate reading-cloze
explainaboard --task cloze-multiple-choice --dataset gaokao2018_np1 --sub_dataset reading-cloze --system_outputs ./data/system_outputs/rst_2018_quanguojuan1_reading_dependent_cloze.json > report.json
(6) Evaluate writing-grammar
explainaboard --task grammatical-error-correction --dataset gaokao2018_np1 --sub_dataset writing-grammar --system_outputs ./data/system_outputs/rst_2018_quanguojuan1_gec.json > report.json
(7) Evaluate writing-essay
explainaboard --task conditional-generation --dataset gaokao2018_np1 --sub_dataset writing-essay --metrics bleu --system_outputs ./data/system_outputs/rst_writing_essay.tsv > report.json

Notably, here we temporarily use the evaluation metric bleu. But if you want your generated essay being evaluated by human (high-school teachers), you can send us

Method 2: Using ExplainaBoard Web Platform

An alternative way to test your Gaokao system w.r.t. each question type (e.g., listening) is using the ExplainaBoard web platform. Specifically,

For your convenience, we detail how to fill in the submission form for each question type in Gaokao:

(1) Evaluate listening

For other types of questions (i.e., subdataset), ones can follow the above prompt to fill in the form, where the only differences exist in the Task and Metrics. We detail them below.

(2) Evaluate cloze-multiple-choice
(3) Evaluate cloze-hint
(4) Evaluate reading-multiple-choice
(5) Evaluate reading-cloze
(6) Evaluate writing-grammar
(7) Evaluate writing-essay

Notably, currently the generated essays are evaluated by the evaluation metric bleu. But if you want your generated essay being evaluated by human (high-school teachers), you can send us

How to submit your AI to Gaokao Benchmark

So far, the total number of subdatasets covered in Gaokao Benchmark is 70, it's troublesome if users process or upload them individually. To further facilite users' evaluation, we provide an API that users can upload all 70 system outputs at one time. The general idea is:

Check out detailed documentation

Where to browse Gaokao leaderboard

Add more papers into Gaokao Benchmark