Home

Awesome

AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

This repository contains the code for our paper AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages which will appear at the third Simple and Efficient Natural Language Processing, at EMNLP 2022.

Our self-active learning framework

Model

Languages Covered

AfroLM has been pretrained from scratch on 23 African Languages: Amharic, Afan Oromo, Bambara, Ghomalá, Éwé, Fon, Hausa, Ìgbò, Kinyarwanda, Lingala, Luganda, Luo, Mooré, Chewa, Naija, Shona, Swahili, Setswana, Twi, Wolof, Xhosa, Yorùbá, and Zulu.

Evaluation Results

AfroLM was evaluated on MasakhaNER1.0 (10 African Languages) and MasakhaNER2.0 (21 African Languages) datasets; on text classification and sentiment analysis. AfroLM outperformed AfriBERTa, mBERT, and XLMR-base, and was very competitive with AfroXLMR. AfroLM is also very data efficient because it was pretrained on a dataset 14x+ smaller than its competitors' datasets. Below are the average F1-score performance of various models, across various datasets. Please consult our paper for more language-level performance.

ModelMasakhaNERMasakhaNER2.0*Text Classification (Yoruba/Hausa)Sentiment Analysis (YOSM)OOD Sentiment Analysis (Twitter -> YOSM)
AfroLM-Large80.1383.2682.90/91.0085.4068.70
AfriBERTa79.1081.3183.22/90.8682.7065.90
mBERT71.5580.68---------
XLMR-base79.1683.09---------
AfroXLMR-base81.9084.55---------

Pretrained Models and Dataset

Models:: AfroLM-Large and Dataset: AfroLM Dataset

HuggingFace usage of AfroLM-large

from transformers import XLMRobertaModel, XLMRobertaTokenizer
model = XLMRobertaModel.from_pretrained("bonadossou/afrolm_active_learning")
tokenizer = XLMRobertaTokenizer.from_pretrained("bonadossou/afrolm_active_learning")
tokenizer.model_max_length = 256

Autotokenizer class does not successfully load our tokenizer. So we recommend to use directly the XLMRobertaTokenizer class. Depending on your task, you will load the according mode of the model. Read the XLMRoberta Documentation

Reproducing our result: Training and Evaluation

Citation

@inproceedings{dossou-etal-2022-afrolm, title = "{A}fro{LM}: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 {A}frican Languages", author = "Dossou, Bonaventure F. P. and Tonja, Atnafu Lambebo and Yousuf, Oreen and Osei, Salomey and Oppong, Abigail and Shode, Iyanuoluwa and Awoyomi, Oluwabusayo Olufunke and Emezue, Chris", booktitle = "Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.sustainlp-1.11", pages = "52--64"}

Reach out

Do you have a question? Please create an issue and we will reach out as soon as possible