Awesome
Meta info
Objective
Create a crowd-sourced repository for NLP sources for native languages of Sri Lanka.
Target languages
Sinhala and Tamil
What to add
- Datasets
- Research publications
- Open source implementations
- Useful articles, blog posts
- Other linguistic resources (e.g. Sinhala/Tamil phonology)
- Tools (or links to proposals) for crowdsourced data collection
How to contribute
- Send a pull request to lklangs.github.io with your changes or send an email to lklangs2019@gmail.com
- Use Browserling tool to generate HTML from
README.md
.
To do
- Resource templates
- Structuring
- Update website
Resources
Publications
- TICO-19: the Translation Initiative for COvid-19 2020
- The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English 2019
- Survey on Publicly Available Sinhala Natural Language Processing Tools and Research 2019
- Natural Language Processing for Government: Problems and Potential 2019
- Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English
- Comparison Between Performance of Various Database Systems for Implementing a Language Corpus 2015
- Implementing a Corpus for Sinhala Language 2015
- Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language 2015
- Sinhala Handwriting Recognition Mechanism Using Zone Based Feature Extraction 2015
- Sinhala-Tamil Machine Translation: Towards better Translation Quality 2014
- Building a WordNet for Sinhala 2014
- Corpus-based Sinhala Lexicon 2009
Datasets
-
Language Technology Research Lab University of Colombo School of Computing
-
Google Dakshina Dataset for 12 South Asian languages including Sinhala and Tamil
-
Databricks Dolly 15k Sinhala Dataset a machine translated version of Databricks 15k dataset.
-
Sinhala Question Answering Dataset 1k a Sinhala QA dataset with English translations.
-
Sinhala Wikipedia 202306 Sinhala wikipedia according to 2023 June dump in huggingface datasets format.
-
Fasttext bulit by Facebook using wikipedia.
-
Fasttext bulit by CSE, UoM using wikipedia, News, and official government documents
-
Sinhala-English parallel corpora
- Subtitle pairs (600k+)
- Sentences pairs (45k+): GNOME, KDE , and Ubuntu
Tools
- Machine Translation System for Sinhala -Tamil Language Pair
- MIDAS-NMT-English-Tamil
- Sentiment Analysis of Sinhala News Comments
- National Langauges Processing Centre (UoM) on Github
- Sinhala Sentence Similarity Measurement
- Tamil Emotion Tweet Scraper
- Morphological analyser and tokenizer for Sinhala nuons (SinLing)
- Sinhala-and-Tamil-NER
- Tamil-Tokeniser
- Sinhala-Tokeniser
- Sinhala skipgram model
- Sinhala TTS Recipe
- Sinhala ASR Recipe