Home

Awesome

sugali

This is a legacy repository of the language identification project for many (many) languages project for the software project course, NLP projects for low-resource languages.

Final technical report on http://www.coli.uni-saarland.de/courses/cl4lrl-swp/data/SugaliPoster.pdf

Description

Given a string of text in an arbitrary language, can we train a system to recognize what language the text is written in? The project uses three sources of data: the Universal Declaration of Human Rights, Wikipedia, ODIN, and some portions of the data available from Omniglot. The resulting sytem cover well over 1000 languages with their system.

As a spin-off, we've also produce the SeedLing corpus with data from over a 1000 languages. The corpus is freely available on the SeedLing github repository. The reference paper for the corpus is on https://www.aclweb.org/anthology/W14-2211/

Credits

Cite

If you would need to refer to the poster or the code, feel free to cite

@misc{sugali,
  author = {Susanne Fertmann and Guy Emerson and Liling Tan},
  title = {Language Identification for Low-Resource Languages},
  year = {2014}, 
  url = "https://github.com/alvations/sugali/",
  institution = {Saarland University, Germany},
  note = "Technical Report for NLP projects for low-resource languages. Saarland, Germany"
}