Awesome
Awesome New Languages in Machine Translation
This is a list of initiatives for adding new languages to opensource machine translation models (such as NLLB).
Also, some notable projects for increasing the translation quality for an already supported low-resourced language would be highlighted.
The first part of the document lists individual languages in the alphabetic order of their English names.
The second part of the document lists multilingual initiatives.
Any new additions are welcome (in the form of pull requests or issues)!
Single-language projects
Ainu
Amis
Aromanian
- Code and description: https://github.com/lolismek/AroTranslate
- Paper: https://arxiv.org/abs/2410.17728
- Interface: https://arotranslate.com/
Awajun
Bambara
Buryat
- Press: https://burunen.ru/news/society/107048 (in Russian)
- Interface: https://translate-bur.ru/
- Model: https://huggingface.co/SaranaAbidueva/nllb-200-bxr-ru
Circassian (Kabardian)
- Interface: https://www.zedzek.com/en
Erzya
- Interface: https://lango.to/
- Paper (for an old version): https://aclanthology.org/2022.fieldmatters-1.6/
Additionally, see TartuNLP.
Fula
Hill Mari
See TartuNLP
Interslavic
- Model: https://huggingface.co/Salavat/nllb-200-distilled-600M-finetuned-isv_v2
- Demo: https://huggingface.co/spaces/Salavat/Interslavic-Translator-NLLB200
- Presentation: https://www.youtube.com/watch?v=BiNrza83Gvw
Karakalpak
- Interface: https://tahrirchi.uz/uz/translator
Komi
See TartuNLP
Lezgian
lez
, lezg1247
- Model: https://huggingface.co/leks-forever/nllb-200-distilled-600M
- Code: https://github.com/leks-forever/nllb-tuning
- Demo: https://huggingface.co/spaces/leks-forever/lezghian-nllb-200-distilled-600M
- Description (in Russian): in a Telegram channel
Livonian
See TartuNLP
Livvi Karelian
See TartuNLP
Mansi
Mari (Meadow)
See TartuNLP
Moksha
See TartuNLP
Ngambay
Qarachay Malqar
- Github: https://github.com/TBSj/Qarachay_Malqar_translator
- Model: https://huggingface.co/TSjB/NLLB-201-600M-QM-V1
- Blog post (rus): https://habr.com/ru/articles/829248/
Tyvan
- Blog: https://cointegrated.medium.com/a37fc706b865
- Interface: https://tyvan.ru/
Udmurt
See TartuNLP
Zarma
Multilingual projects
Finno-Ugric languages (tartuNLP)
Multiple Finno-Ugric languages (including Komi, Udmurt, Hill and Meadow Mari, Erzya, Livonian, Mansi, Moksha and Livvi Karelian)
- Paper (an early one): https://aclanthology.org/2022.wmt-1.33/
- Paper: https://aclanthology.org/2023.nodalida-1.77.pdf
- Interface: https://translate.ut.ee/
- Model: https://huggingface.co/tartuNLP/smugri3_14-finno-ugric-nmt
Indigenous languages of the Americas (AmericasNLP Shared Tasks)
Indigenous languages of the Americas (including Ashaninka, Aymara, Bribri, Chatino, Guarani, Hñähñu, Nahuatl, Quechua, Raramuri, Shipibo-Konibo, and Wixarika from the AmericasNLP Mt shared task, and Wayuunaiki, Arhuaco, Inga, and Nasa – additionally)
- Paper: https://aclanthology.org/2023.americasnlp-1.19.pdf
- Paper: https://aclanthology.org/2024.americasnlp-1.22.pdf
- Paper: https://aclanthology.org/2024.americasnlp-1.2.pdf
Hundreds of diverse languages (Apertium)
Apertium is a system of rule-based machine translation.
Currently, it has linguistic tools (such as dictionaries and morphological parsers) for an insane number of languages, but only few of them (51 language pairs) have been developed to a state considered stable enough for publicly releasing a translation service.
- Code: https://github.com/apertium
- Interface (with only a subset of the most stable language pairs): https://www.apertium.org/