Awesome
Lacuna Funded Project: MasakhaNER
Datasets developed by the projects are:
- MAFAND-MT: Masakhane Anglo & Franco African News Corpus for Machine Translation
- MasakhaNER 2.0: Named Entity Recognition datasets for 20 African languages
- African POS datasets
Team & Partners
- Peter Nabende (Makerere University) - Principal Investigator
- David Ifeoluwa Adelani (Masakhane; Saarland University) - NER Coordinator
- Bamba Dione (Masakhane; University of Bergen) - POS Coordinator
- Jade Abbott (Masakhane; Retro Rabbit) - Data & Translation Coordinator
- Constantine Lignos (Masakhane; Brandeis University) - Quality Control
- Daniel D’souza (Masakhane) - Tool Management
- Sascha Heyer (IO Annotator) - Tool Development & Support
Language Coordinators
Language | Coordinator |
---|---|
Bambara | Allahsera Auguste Tapo |
Chichewa | Amelia Taylor |
Ewe | Godson Kalipe |
Fon | Bonaventure Dossou |
Ghomala | Koagne Victoire Memdjokam |
Hausa | Tajuddeen Gwadabe |
Igbo | Chris Emezue |
Kinyarwanda | Happy Buzaaba |
Luganda | Jonathan Mukiibi |
Luo | Perez Ogayo |
Moore | Fatoumata Kabore |
Nigerian-Pidgin | Aremu Anuoluwapo |
Setswana | Valencia Wagner |
Shona | Blessing Sibanda |
Swahili | Catherine Gitau |
Twi | Edwin Buabeng-Munkoh |
Wolof | Derguene Mbaye |
isiXhosa | Andiswa Bukula |
Yorùbá | Jesujoba Alabi |
isiZulu | Rooweither Mabuya |
Adding a corpus to the project
It is better to have a folder for each language (folder_name is iso 693-3 letter code) which will have two files,
- corpus with filename (iso 693-3 language code) e.g xho.txt
- A readme file describing the number of articles sentences, and tokens in the corpus. If possible, please specify news categories for the articles, since we prefer a balanced dataset across different categories.