Awesome

<img src="https://raw.githubusercontent.com/fajri91/eval_picts/master/ArabicMMLU-Bar.png" style="width: 100%;" id="title-icon"> Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Boda Sadallah, Aisha Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash, Preslav Nakov, and Timothy Baldwin <h4 align="left"> MBZUAI, Prince Sattam bin Abdulaziz University, KFUPM, Core42, NYU Abu Dhabi, The University of Melbourne </h4>

:fire: News

[2024-02.21] The preprint of our paper can be found here.
[2024-06.16] ArabicMMLU has been accepted at ACL 2024.

Introduction

We present ArabicMMLU, the first multi-task language understanding benchmark for Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA), and is carefully constructed by collaborating with native speakers in the region.

Data

Each question in the dataset is a multiple-choice question with up to 5 choices and only one choice as the correct answer. The dataset can be accessed in data folder, and Hugging Face.

import datasets
data = datasets.load_dataset('MBZUAI/ArabicMMLU', 'All')

Statistics

The data construction process involved a total of 10 Arabic native speakers from different countries: 6 internal workers (1 Jordanian, 1 Egyptian, 1 Lebanese, 1 from UAE, and 2 from KSA) and 4 external workers (3 Jordanian and 1 Egyptian). The resulting corpus is sourced from the eight countries, with Jordan, Egypt, and Palestine being the top three sources. We categorize the collected questions into different subject areas, including: (1) STEM (Science, Technology, Engineering, and Mathematics); (2) Social Science; (3) Humanities; (4) Arabic Language; and (5) Others.

Examples

These questions are written in Arabic.

Evaluation

We evaluate 22 open-source multilingual models, 11 open-source Arabic-centric models, and 2 closed-source models. We experimented with different prompts in Arabic and English, and found the English prompt is the best. Below is the examples of input with the prompt.

Zero-shot Evaluation

Few-shot Evaluation

Evaluation

The code for the evaluation of each model we used is in evaluate.py, and the code to run them is listed in run.sh.

Citation

@inproceedings{koto2024arabicmmlu,
  title={ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic},
  author={"Fajri Koto and Haonan Li and Sara Shatanawi and Jad Doughman and Abdelrahman Boda Sadallah and Aisha Alraeesi and Khalid Almubarak and Zaid Alyafeai and Neha Sengupta and Shady Shehata and Nizar Habash and Preslav Nakov and Timothy Baldwin"},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2024},
  year={2024}
}

License

The ArabicMMLU dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.