The HornMT
repository contains data and the associated metadata for the project Machine Translation Benchmark Dataset for Languages in the Horn of Africa. It is a multi-way parallel corpus that will serve as a benchmark to accelerate progress in machine translation research and production systems for languages in the Horn of Africa.
Supported Languages
Language | ISO 639-3 code |
Afar | aaf |
Amharic | amh |
English | eng |
Oromo | orm |
Somali | som |
Tigrinya | tir |
contains one text file per language and each file contains news snippets in the same order for each language.
├── aar.txt
├── amh.txt
├── eng.txt
├── orm.txt
├── som.txt
└── tir.txt
contains tab separated data describing each news snippet. The metadata contains the following fields.
- Scope - describes whether the news is global or local. It takes two values: Global news and Local news.
- Category - News category covering the following 12 topics
- Art and Culture
- Business and Economy
- Conflicts and Attacks
- Disaster and Accidents
- Entertainment
- Environment
- Health
- International Relations
- Law and Crime
- Politics
- Science and Technology
- Sport
- Source - List of one or more URLs from which the news content is extracted or based on.
- Domain - TLD corresponding to the URL(s) in Source.
- Date - The publication date of the source article. The format is yyyy-mm-dd.
Other formats
All the data and associated metadata together in one file is also available in other file formats.
- data and associated metadata in xlsx format.
- data and associated metadata in json format.
Below is an example row.
"eng":"The World Meteorological Organisation reports that the ozone layer is damaged to its worst extent ever in the Arctic.",
"aaf":"Baad Metrolojih Eglali Areketekeh Addal Ozonih qelu faxe waktik lafetle calat biyakisem xayose.",
"amh":"የአለም የአየር ንብረት ድርጅት በአርክቲክ አካባቢ ያለው የኦዞን ምንጣፍ ከፍተኛ ጉዳት እንደደረሰበት አስታወቀ፡፡",
"orm":"Dhaabbanni Meetiroolojii Addunyaa baqqaanni oozonii Arkiitik keessatti gara sadarkaa isa hamaa haga ammaatti akka miidhame gabaase.",
"som":"Ururka Saadaasha Hawada Adduunka ayaa ku warramaya in lakabka ozoneka ee Ka koreeya dhulka baraflayda uu waxyeelladii abid ugu darnaa soo gaadhay.",
"tir":"ውድብ ሜትሮሎጂ ዓለም ኣብ ኣርክቲክ ዝርከብ ናሕሲ ኦዞን ኣዝዩ ብዝኸፍአ ደረጃ ከምዝተጎድአ ሓቢሩ፡፡"
"category":"Science and Technology",
Translation Team
- Mohammed Deresa
- Yasin Nur
- Tigist Taye
- Selamawit Hailemariam
- Wako Tilahun
- Gemechis Melkamu
- Galata Girmaye
- Abdiselam Mohamed
- Beshir Abdi
- Berhanu Abadi Weldegiorgis
- Michael Minassie
- Nureddin Mohammedshiek
Project Leaders
- Asmelash Teka Hadgu
- Gebrekirstos G. Gebremeskel
- Abel Aregawi
This work is licensed under a Creative Commons Attribution 4.0 International License.