Home

Awesome

Ukuxhumana

"Ukuxhumana" means "Communicate" in Zulu. This project is aimed at exploring ideas for using Neural Machine Translation for low-resource languages - right now, specifically for the official languages of South Africa, but we are looking for collaborators across the continent to work together with us for the other languages

Mission

Data

Parallel Corpuses

Our parallel corpuses are from the Autshumato project. The datasets contain data that was translated by professional translators, data that was sourced as translated file pairs from translators and data obtained from Government websites and documents. We also performed extra cleaning on the corpuses, which is described here

Monolingual Corpuses

Our monolingual corpuses are from a variety of sources. We've used the monolingual corpuses for use in the training of fastText embeddings, which are also used in Unsupervised NMT.

Zulu

English

Known Corpuses

We keep a list of known corpuses for African languages here. Please consider contributing a link to your corpus :)

Models

Currently, two main architectures are used throughout this project, namely Convolutional Sequence to Sequence by Gehring et. al. (2017) and Transformer by Vaswani et. al (2017). Fairseq(-py) and Tensor2Tensor were used in modeling these techniques respectively. For each language, a model was trained using byte-pair encoding (BPE) for tokenisation. The learning rate was set to 0.25 and dropout to 0.2. Beam search with a width of 5 was used in decoding the test data.

The original Tensor2Tensor implementation of Transformer was used. The learning rate was set to 0.4, with a batch size of 1024, and a learning rate warm-up of 45000 steps. Tokenisation was done using WordPiece. Beam search with width 4 was used for decoding.

Results

Results are given in BLEU.

Baseline

English -> Language

ModelSetswanaisiZulu*Northern SothoXitsongaAfrikaans
Google Translate7.5541.181
Convolutional Seq2Seq (clean)24.180.287.4136.9616.17
Convolutional Seq2Seq (best BPE)26.36 (40k)1.79 (4k)12.18 (4k)37.45 (20k)25.04 (4k)
Transformer (uncased)33.533.3324.16 (4k)49.74 (20k)35.26 (4k)
Transformer (cased)33.123.16 (4k)23.77 (4k)49.30 (20k)34.81 (4k)
Unsupervised MT (60K BPE)4.45

* Zulu data requires cleaning. Translations often contain more information than in original sentence, leading to poor BLEU scores.

Autshumato Machine Translation Benchmark

ModelAfrikaansisiZuluNorthern SothoSetswanaXitsonga
Convolutional Seq2Seq12.300.527.4110.3110.73
Transformer20.601.3410.9415.6017.98

Publications & Citations

Benchmarking Neural Machine Translation for Southern African Languages

A Focus on Neural Machine Translation for African Languages

Towards Neural Machine Translation for African Languages