Home

Awesome

Natural Language Toolkit for Indic Languages (iNLTK)

Gitter Downloads

iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. Paper for iNLTK library has been accepted at EMNLP-2020's NLP-OSS workshop. Here's the link to the paper

Documentation

Checkout detailed docs along with Installation instructions at https://inltk.readthedocs.io

Supported languages

Native languages

LanguageCode <code-of-language>
Hindihi
Punjabipa
Gujaratigu
Kannadakn
Malayalamml
Oriyaor
Marathimr
Bengalibn
Tamilta
Urduur
Nepaline
Sanskritsa
Englishen
Telugute

Code Mixed languages

LanguageScriptCode <code-of-language>
Hinglish (Hindi+English)Latinhi-en
Tanglish (Tamil+English)Latinta-en
Manglish (Malayalam+English)Latinml-en

Repositories containing models used in iNLTK

LanguageRepositoryDataset used for Language modelingPerplexity of ULMFiT LM<br>(on validation set)Perplexity of TransformerXL LM<br>(on validation set)Dataset used for ClassificationClassification:<br> Test set AccuracyClassification: <br>Test set MCCClassification: Notebook<br>for ReproducibilityULMFiT Embeddings visualizationTransformerXL Embeddings visualization
HindiNLP for HindiHindi Wikipedia Articles - 172k<br><br><br>Hindi Wikipedia Articles - 55k34.06<br><br><br>35.8726.09<br><br><br>34.78BBC News Articles<br><br><br>IIT Patna Movie Reviews<br><br><br>IIT Patna Product Reviews78.75<br><br><br>57.74<br><br><br>75.710.71<br><br><br>0.37<br><br><br>0.59Notebook<br><br><br>Notebook<br><br><br>NotebookHindi Embeddings projectionHindi Embeddings projection
BengaliNLP for BengaliBengali Wikipedia Articles41.239.3Bengali News Articles (Soham Articles)90.710.87NotebookBengali Embeddings projectionBengali Embeddings projection
GujaratiNLP for GujaratiGujarati Wikipedia Articles34.1228.12iNLTK Headlines Corpus - Gujarati91.050.86NotebookGujarati Embeddings projectionGujarati Embeddings projection
MalayalamNLP for MalayalamMalayalam Wikipedia Articles26.3925.79iNLTK Headlines Corpus - Malayalam95.560.93NotebookMalayalam Embeddings projectionMalayalam Embeddings projection
MarathiNLP for MarathiMarathi Wikipedia Articles1817.42iNLTK Headlines Corpus - Marathi92.400.85NotebookMarathi Embeddings projectionMarathi Embeddings projection
TamilNLP for TamilTamil Wikipedia Articles19.8017.22iNLTK Headlines Corpus - Tamil95.220.92NotebookTamil Embeddings projectionTamil Embeddings projection
PunjabiNLP for PunjabiPunjabi Wikipedia Articles24.4014.03IndicNLP News Article Classification Dataset - Punjabi97.120.96NotebookPunjabi Embeddings projectionPunjabi Embeddings projection
KannadaNLP for KannadaKannada Wikipedia Articles70.1061.97IndicNLP News Article Classification Dataset - Kannada98.870.98NotebookKannada Embeddings projectionKannada Embeddings projection
OriyaNLP for OriyaOriya Wikipedia Articles26.5726.81IndicNLP News Article Classification Dataset - Oriya98.830.98NotebookOriya Embeddings ProjectionOriya Embeddings Projection
SanskritNLP for SanskritSanskrit Wikipedia Articles~6~3Sanskrit Shlokas Dataset84.3 (valid set)Sanskrit Embeddings projectionSanskrit Embeddings projection
NepaliNLP for NepaliNepali Wikipedia Articles31.529.3Nepali News Dataset98.5 (valid set)Nepali Embeddings projectionNepali Embeddings projection
UrduNLP for UrduUrdu Wikipedia Articles13.1912.55Urdu News Dataset95.28 (valid set)Urdu Embeddings projectionUrdu Embeddings projection
TeluguNLP for TeluguTelugu Wikipedia Articles27.4729.44Telugu News Dataset<br><br><br>Telugu News Andhra Jyoti95.4<br><br><br>92.09Notebook <br><br><br>NotebookTelugu Embeddings projectionTelugu Embeddings projection
TanglishNLP for TanglishSynthetic Tanglish Dataset37.50-Dravidian Codemix HASOC @ FIRE 2020<br><br>Dravidian Codemix Sentiment Analysis @ FIRE 2020F1 Score: 0.88<br><br>F1 Score: 0.62-Notebook<br><br>NotebookTanglish Embeddings Projection-
ManglishNLP for ManglishSynthetic Manglish Dataset45.84-Dravidian Codemix HASOC @ FIRE 2020<br><br>Dravidian Codemix Sentiment Analysis @ FIRE 2020F1 Score: 0.74<br><br>F1 Score: 0.69-Notebook<br><br>NotebookManglish Embeddings Projection-
HinglishNLP for HinglishSynthetic Hinglish Dataset86.48-----Hinglish Embeddings Projection-

Note: English model has been directly taken from fast.ai

Effect of using Transfer Learning + Paraphrases from iNLTK

LanguageRepositoryDataset used for ClassificationResults on using<br>complete training setPercentage Decrease <br>in Training set sizeResults on using<br>reduced training set<br>without ParaphrasesResults on using<br>reduced training set<br>with Paraphrases
HindiNLP for HindiIIT Patna Movie ReviewsAccuracy: 57.74<br><br>MCC: 37.2380% (2480 -> 496)Accuracy: 47.74<br><br>MCC: 20.50Accuracy: 56.13<br><br>MCC: 34.39
BengaliNLP for BengaliBengali News Articles (Soham Articles)Accuracy: 90.71<br><br>MCC: 87.9299% (11284 -> 112)Accuracy: 69.88<br><br>MCC: 61.56Accuracy: 74.06<br><br>MCC: 65.08
GujaratiNLP for GujaratiiNLTK Headlines Corpus - GujaratiAccuracy: 91.05<br><br>MCC: 86.0990% (5269 -> 526)Accuracy: 80.88<br><br>MCC: 70.18Accuracy: 81.03<br><br>MCC: 70.44
MalayalamNLP for MalayalamiNLTK Headlines Corpus - MalayalamAccuracy: 95.56<br><br>MCC: 93.2990% (5036 -> 503)Accuracy: 82.38<br><br>MCC: 73.47Accuracy: 84.29<br><br>MCC: 76.36
MarathiNLP for MarathiiNLTK Headlines Corpus - MarathiAccuracy: 92.40<br><br>MCC: 85.2395% (9672 -> 483)Accuracy: 84.13<br><br>MCC: 68.59Accuracy: 84.55<br><br>MCC: 69.11
TamilNLP for TamiliNLTK Headlines Corpus - TamilAccuracy: 95.22<br><br>MCC: 92.7095% (5346 -> 267)Accuracy: 86.25<br><br>MCC: 79.42Accuracy: 89.84<br><br>MCC: 84.63

For more details around implementation or to reproduce results, checkout respective repositories.

Contributing

Add a new language support

If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue here

Please checkout the steps I'd mentioned here for Telugu to begin with. They should be almost similar for other languages as well.

Improving models/using models for your own research

If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that.

Add new functionality

If you wish for a particular functionality in iNLTK - Start by checking/raising a issue here

What's next

..and being worked upon

Shout out if you want to help :)

..and NOT being worked upon

Shout out if you want to lead :)

iNLTK's Appreciation

Citation

If you use this library in your research, please consider citing:

@inproceedings{arora-2020-inltk,
    title = "i{NLTK}: Natural Language Toolkit for Indic Languages",
    author = "Arora, Gaurav",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.10",
    doi = "10.18653/v1/2020.nlposs-1.10",
    pages = "66--71",
    abstract = "We present iNLTK, an open-source NLP library consisting of pre-trained language models and out-of-the-box support for Data Augmentation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic Languages. By using pre-trained models from iNLTK for text classification on publicly available datasets, we significantly outperform previously reported results. On these datasets, we also show that by using pre-trained models and data augmentation from iNLTK, we can achieve more than 95{\%} of the previous best performance by using less than 10{\%} of the training data. iNLTK is already being widely used by the community and has 40,000+ downloads, 600+ stars and 100+ forks on GitHub. The library is available at https://github.com/goru001/inltk.",
}