Home

Awesome

KINNEWS-and-KIRNEWS

Data, Embeddings, Stopword lists, code, and baselines for COLING 2020 paper titled "KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi" by Rubungo Andre Niyongabo, Hong Qu, Julia Kreutzer, and Li Huang.

This paper introduces Kinyarwanda and Kirundi news classification datasets (KINNEWS and KIRNEWS,respectively), which were both collected from Rwanda and Burundi news websites and newspapers, for low-resource monolingual and cross-lingual multiclass classification tasks. Along with the datasets, we provide statistics, guidelines for preprocessing, pretrained word embeddings, and monolingual and cross-lingual baseline models.

Note: Please, when using any of the resources provided here, remember to cite our paper.

Data

Download the datasets

Datasets description

Each dataset is in camma-separated-value (csv) format, with columns that are described bellow (Note that in the cleaned versions we only remain with 'label','title', and 'content' columns):

FieldDescription
labelNumerical labels that range from 1 to 14
en_labelEnglish labels
kin_labelKinyarwanda labels
kir_labelKirundi labels
urlThe link to the news source
titleThe title of the news article
contentThe full content of the news article

Word embeddings

Download pre-trained word embeddings

Training your own embeddings

To train you own word vectors, check out code/embeddings/word2vec_training.py file or refer to this gensim documentation.

Stopwords

To use our stopwords you may just copy the whole stopset_kin for Kinyarwanda and stopset_kir for Kirundi into your code or import them directly from KKLTK package, which is more recommended.

Leaderboard (baselines)

Monolingual

KINNEWS

ModelAccuracy(%)
BiGRU(W2V-Kin-50*)88.65
SVM(TF-IDF)88.53
BiGRU(W2V-Kin-100)88.29
CNN(W2V-Kin-50)87.55
CNN(W2V-Kin-100)87.54
LR(TF-IDF)87.14
MNB(TF-IDF)82.70
Char-CNN71.70

KIRNEWS

ModelAccuracy(%)
SVM(TF-IDF)90.14
CNN(W2V-Kin-100)88.01
BiGRU(W2V-Kin-100)86.61
LR(TF-IDF)86.13
BiGRU(W2V-Kin-50)85.86
CNN(W2V-Kin-50)85.75
MNB(TF-IDF)82.67
Char-CNN69.23

Cross-lingual

ModelTrain setTest setAccuracy(%)
MNB(TF-IDF)KINNEWSKIRNEWS73.46
SVM(TF-IDF)KINNEWSKIRNEWS72.70
LR(TF-IDF)KINNEWSKIRNEWS68.26
BiGRU(W2V-Kin-50)KINNEWSKIRNEWS67.54
BiGRU(W2V-Kin-100*)KINNEWSKIRNEWS65.06
CNN(W2V-Kin-100)KINNEWSKIRNEWS61.72
CNN(W2V-Kin-50)KINNEWSKIRNEWS60.64
Char-CNNKINNEWSKIRNEWS49.60
ModelTrain setTest setAccuracy(%)
CNN(W2V-Kin-100)KIRNEWSKIRNEWS88.01
BiGRU(W2V-Kin-100)KIRNEWSKIRNEWS86.61
CNN(W2V-Kin-50)KIRNEWSKIRNEWS85.75
BiGRU(W2V-Kin-50)KIRNEWSKIRNEWS83.38