Awesome
KINNEWS-and-KIRNEWS
Data, Embeddings, Stopword lists, code, and baselines for COLING 2020 paper titled "KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi" by Rubungo Andre Niyongabo, Hong Qu, Julia Kreutzer, and Li Huang.
This paper introduces Kinyarwanda and Kirundi news classification datasets (KINNEWS and KIRNEWS,respectively), which were both collected from Rwanda and Burundi news websites and newspapers, for low-resource monolingual and cross-lingual multiclass classification tasks. Along with the datasets, we provide statistics, guidelines for preprocessing, pretrained word embeddings, and monolingual and cross-lingual baseline models.
Note: Please, when using any of the resources provided here, remember to cite our paper.
Data
Download the datasets
- The raw and cleaned versions of KINNEWS can be downloaded from here (21,268 articles, 14 classes, 45.07MB(raw) and 38.85MB(cleaned))
- The raw and cleaned versions of KIRNEWS can be downloaded from here (4,612 articles, 12 classes, 9.31MB(raw) and 7.77MB(cleaned))
Datasets description
Each dataset is in camma-separated-value (csv) format, with columns that are described bellow (Note that in the cleaned versions we only remain with 'label','title', and 'content' columns):
Field | Description |
---|---|
label | Numerical labels that range from 1 to 14 |
en_label | English labels |
kin_label | Kinyarwanda labels |
kir_label | Kirundi labels |
url | The link to the news source |
title | The title of the news article |
content | The full content of the news article |
Word embeddings
Download pre-trained word embeddings
- The Kinyarwanda embeddings can be downloaded form here (59.88MB for 100d and 29.94MB for 50d)
- The Kirundi embeddings can be downloaded from here (17.98MB for 100d and 8.96MB for 50d)
Training your own embeddings
To train you own word vectors, check out code/embeddings/word2vec_training.py file or refer to this gensim documentation.
Stopwords
To use our stopwords you may just copy the whole stopset_kin for Kinyarwanda and stopset_kir for Kirundi into your code or import them directly from KKLTK package, which is more recommended.
Leaderboard (baselines)
Monolingual
KINNEWS
Model | Accuracy(%) |
---|---|
BiGRU(W2V-Kin-50*) | 88.65 |
SVM(TF-IDF) | 88.53 |
BiGRU(W2V-Kin-100) | 88.29 |
CNN(W2V-Kin-50) | 87.55 |
CNN(W2V-Kin-100) | 87.54 |
LR(TF-IDF) | 87.14 |
MNB(TF-IDF) | 82.70 |
Char-CNN | 71.70 |
KIRNEWS
Model | Accuracy(%) |
---|---|
SVM(TF-IDF) | 90.14 |
CNN(W2V-Kin-100) | 88.01 |
BiGRU(W2V-Kin-100) | 86.61 |
LR(TF-IDF) | 86.13 |
BiGRU(W2V-Kin-50) | 85.86 |
CNN(W2V-Kin-50) | 85.75 |
MNB(TF-IDF) | 82.67 |
Char-CNN | 69.23 |
Cross-lingual
Model | Train set | Test set | Accuracy(%) |
---|---|---|---|
MNB(TF-IDF) | KINNEWS | KIRNEWS | 73.46 |
SVM(TF-IDF) | KINNEWS | KIRNEWS | 72.70 |
LR(TF-IDF) | KINNEWS | KIRNEWS | 68.26 |
BiGRU(W2V-Kin-50) | KINNEWS | KIRNEWS | 67.54 |
BiGRU(W2V-Kin-100*) | KINNEWS | KIRNEWS | 65.06 |
CNN(W2V-Kin-100) | KINNEWS | KIRNEWS | 61.72 |
CNN(W2V-Kin-50) | KINNEWS | KIRNEWS | 60.64 |
Char-CNN | KINNEWS | KIRNEWS | 49.60 |
Model | Train set | Test set | Accuracy(%) |
---|---|---|---|
CNN(W2V-Kin-100) | KIRNEWS | KIRNEWS | 88.01 |
BiGRU(W2V-Kin-100) | KIRNEWS | KIRNEWS | 86.61 |
CNN(W2V-Kin-50) | KIRNEWS | KIRNEWS | 85.75 |
BiGRU(W2V-Kin-50) | KIRNEWS | KIRNEWS | 83.38 |