Home

Awesome

MMCU

This is the code repository for paper Measuring Massive Multitask Chinese Understanding https://arxiv.org/abs/2304.12986

Please download the dataset at https://huggingface.co/datasets/Besteasy/MMCU, <br> or send us an email to apply for free dataset download: order@besteasy.com <br> You may need to clarify your identity (Professor, College Students, NLP researcher/engineer, etc.)<br> For academic exchanges, please contact me at felix.zeng@besteasy.com

重要声明

数据集获取方式:直接在 https://huggingface.co/datasets/Besteasy/MMCU 下载文件。<br> 也可以发邮件至 order@besteasy.com 申请,注明身份及预期用途即可。<br> 本评测只是对大模型语义理解能力的测试,并不能代表模型的全面能力评测,评测结果仅供参考。整个评测方式、评测数据集、评测记录都公开,确保可以复现。<br> 本测试集免费提供给大家是为了帮助各位研究者们评测自己的模型性能,并验证训练策略是否有效,并不是为了给出排行榜,中文大模型的改进任重道远,希望大家以正确的方式充分利用此数据集。<br>

Updates

2023.8.11<br>

中文大模型生成能力自动化评测基准CG-Eval发布(MMCU的姊妹篇,MMCU评测理解能力,CG-Eval评测生成能力),详见论文Evaluating the Generation Capabilities of Large Chinese Language Models https://arxiv.org/abs/2308.04823<br> CG-Eval测试数据集下载地址 https://huggingface.co/datasets/Besteasy/CG-Eval<br> CG-Eval自动化评测地址 http://cgeval.besteasy.com/<br>

2023.5.15<br>

2023.5.13至2023.5.15之间对所有模型进行了重新评测,结果均上传至 test_results 文件夹,公开可见。<br>

2023.5.13<br>

2023.5.13之前拿到数据集的研究者需要手动添加以下问题的答案<br> 心理学 306 题 缺失答案 答案应为 B<br> 随年龄增长,个体的快速眼动睡眠量怎么变化? A.越来越多 B.越来越少 C.呈 U 型变化 D.呈倒 U 型变化<br>

传染病学 126 题 缺失答案 答案应为A<br> 新生儿预防乙型肝炎的最好措施是: A.出生24小时内立即接种基因重组乙型肝炎疫苗 B.出生立即注射乙肝免疫球蛋白 C.尽早注射丙种球蛋白 D.注射人血清和胎盘球蛋白<br>

传染病学 155 题 缺失答案 答案应为C<br> 孕妇于妊娠早期患重型病毒性肝炎,正确的处理应是: A.积极治疗重型肝炎,病情不见好转行人工流产术 B.立即行人工流产术 C.治疗肝炎,待病情好转行人工流产术 D.治疗肝炎同时行人工流产术<br>

2023.5.12<br>

1.修正模型预测答案匹配方法,更好地抽取多选题预测答案<br> 2.将某些题目正确答案中的特殊字符 ABCD 修正为正常字符 A B C D<br> 3.评测结果文件更加直观,采用以下形式记录,第一列为模型预测答案,第二列为标准答案,第三列记录是否答对<br> ABD|||ABCD|||False<br> C|||BD|||False<br> ACD|||ABD|||False<br> BCD|||BCD|||True<br>

评测结果(所有模型为2023年5月15日之前的版本)

四大领域平均分数<br>

zero-shotbloomz_560mbloomz_1b1bloomz_3bbloomz_7b1_mtChatGLM 6BMOSS 16BGPT-3.5-turbo
医疗0.2980.2130.3740.3640.3380.2340.512
法律0.1630.140.180.1740.1690.1330.239
心理学0.2010.1870.3190.3460.2880.2110.447
教育0.2470.2750.3150.3160.3330.2530.455
平均0.2270.2040.2970.3000.2820.2080.413

医疗领域分数<br>

zero-shotbloomz_560mbloomz_1b1bloomz_3bbloomz_7b1_mtChatGLM6BMOSS 16BGPT-3.5-turbo
医学三基0.3110.2740.3750.4150.3710.2310.552
药理学0.2650.2350.380.360.2550.2850.52
护理学0.330.2780.3720.3680.3390.2380.516
病理学0.3120.2670.3920.3410.3580.2780.506
临床医学0.3470.1980.4260.5540.4550.3170.693
传染病学0.2950.2420.3980.460.4010.2540.587
外科学0.3650.2510.3970.3740.3610.2790.525
解剖学0.1820.1360.2270.2270.2730.1360.5
医学影像学0.450.050.60.450.350.250.55
寄生虫学0.330.240.390.250.330.180.43
免疫学0.2820.1470.3310.3440.3190.1780.515
儿科学0.390.2580.3850.380.3990.2630.54
皮肤性病学0.2550.2550.3920.510.4710.2750.627
组织胚胎学0.0580.130.2080.1880.1490.1430.364
药物分析学0.2920.2360.3330.2360.2360.2080.25
医疗平均分0.2980.2130.3740.3640.3380.2340.512

教育领域分数<br>

zero-shotbloomz_560mbloomz_1b1bloomz_3bbloomz_7b1_mtChatGLM6BMOSS 16BGPT-3.5-turbo
语文0.2330.2830.2480.2050.2560.2330.31
数学0.2510.2570.2810.3250.3070.2570.427
物理0.1730.2080.1850.2020.2560.2080.327
化学0.280.340.280.140.30.280.44
政治0.2390.2550.3290.4010.3290.2680.545
历史0.2790.2960.4210.4320.4480.2450.513
地理0.2550.2710.3360.4110.3460.2840.478
生物0.2620.2870.4430.4140.4220.2450.599
平均0.2470.2750.3150.3160.3330.2530.455

Usage

--ntrain 0: do not provide examples<br> --ntrain 5: provide five examples<br>

zero-shot test for chatgpt

python TestChatGPT.py \
 --ntrain 0  \
 --data_dir MMCU_dataset_path  \
 --save_dir path_for_test_results

few-shot test for chatgpt

python TestChatGPT.py \
 --ntrain 5  \
 --data_dir MMCU_dataset_path  \
 --save_dir path_for_test_results

zero-shot test for bloomz

python TestBloomz.py \
 --ntrain 0  \
 --data_dir MMCU_dataset_path  \
 --save_dir path_for_test_results

few-shot test for bloomz

python TestBloomz.py \
 --ntrain 5  \
 --data_dir MMCU_dataset_path  \
 --save_dir path_for_test_results

zero-shot test for chatglm

python TestChatGLM.py \
 --ntrain 0  \
 --data_dir MMCU_dataset_path  \
 --save_dir path_for_test_results

few-shot test for chatglm

python TestChatGLM.py \
 --ntrain 5  \
 --data_dir MMCU_dataset_path  \
 --save_dir path_for_test_results

zero-shot test for MOSS

python TestMOSS.py \
 --ntrain 0  \
 --data_dir MMCU_dataset_path  \
 --save_dir path_for_test_results

few-shot test for MOSS

python TestMOSS.py \
 --ntrain 5  \
 --data_dir MMCU_dataset_path  \
 --save_dir path_for_test_results

Citation

If you find the code and testset are useful in your research, please consider citing

@misc{zeng2023measuring,
      title={Measuring Massive Multitask Chinese Understanding},
      author={Hui Zeng},
      year={2023},
      eprint={2304.12986},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

The MMCU dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

作者简介: LanguageX AI Lab 负责人// 中文大模型评测基准MMCU作者// WMT2022机器翻译通用赛道英中自动评测第一、中译英第三、英译日第三