Awesome

Natural Language Toolkit for Indic Languages (iNLTK)

iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. Paper for iNLTK library has been accepted at EMNLP-2020's NLP-OSS workshop. Here's the link to the paper

Documentation

Checkout detailed docs along with Installation instructions at https://inltk.readthedocs.io

Supported languages

Native languages

Language	Code <code-of-language>
Hindi	hi
Punjabi	pa
Gujarati	gu
Kannada	kn
Malayalam	ml
Oriya	or
Marathi	mr
Bengali	bn
Tamil	ta
Urdu	ur
Nepali	ne
Sanskrit	sa
English	en
Telugu	te

Code Mixed languages

Language	Script	Code <code-of-language>
Hinglish (Hindi+English)	Latin	hi-en
Tanglish (Tamil+English)	Latin	ta-en
Manglish (Malayalam+English)	Latin	ml-en

Repositories containing models used in iNLTK

Language	Repository	Dataset used for Language modeling	Perplexity of ULMFiT LM<br>(on validation set)	Perplexity of TransformerXL LM<br>(on validation set)	Dataset used for Classification	Classification:<br> Test set Accuracy	Classification: <br>Test set MCC	Classification: Notebook<br>for Reproducibility	ULMFiT Embeddings visualization	TransformerXL Embeddings visualization
Hindi	NLP for Hindi	Hindi Wikipedia Articles - 172k<br><br><br>Hindi Wikipedia Articles - 55k	34.06<br><br><br>35.87	26.09<br><br><br>34.78	BBC News Articles<br><br><br>IIT Patna Movie Reviews<br><br><br>IIT Patna Product Reviews	78.75<br><br><br>57.74<br><br><br>75.71	0.71<br><br><br>0.37<br><br><br>0.59	Notebook<br><br><br>Notebook<br><br><br>Notebook	Hindi Embeddings projection	Hindi Embeddings projection
Bengali	NLP for Bengali	Bengali Wikipedia Articles	41.2	39.3	Bengali News Articles (Soham Articles)	90.71	0.87	Notebook	Bengali Embeddings projection	Bengali Embeddings projection
Gujarati	NLP for Gujarati	Gujarati Wikipedia Articles	34.12	28.12	iNLTK Headlines Corpus - Gujarati	91.05	0.86	Notebook	Gujarati Embeddings projection	Gujarati Embeddings projection
Malayalam	NLP for Malayalam	Malayalam Wikipedia Articles	26.39	25.79	iNLTK Headlines Corpus - Malayalam	95.56	0.93	Notebook	Malayalam Embeddings projection	Malayalam Embeddings projection
Marathi	NLP for Marathi	Marathi Wikipedia Articles	18	17.42	iNLTK Headlines Corpus - Marathi	92.40	0.85	Notebook	Marathi Embeddings projection	Marathi Embeddings projection
Tamil	NLP for Tamil	Tamil Wikipedia Articles	19.80	17.22	iNLTK Headlines Corpus - Tamil	95.22	0.92	Notebook	Tamil Embeddings projection	Tamil Embeddings projection
Punjabi	NLP for Punjabi	Punjabi Wikipedia Articles	24.40	14.03	IndicNLP News Article Classification Dataset - Punjabi	97.12	0.96	Notebook	Punjabi Embeddings projection	Punjabi Embeddings projection
Kannada	NLP for Kannada	Kannada Wikipedia Articles	70.10	61.97	IndicNLP News Article Classification Dataset - Kannada	98.87	0.98	Notebook	Kannada Embeddings projection	Kannada Embeddings projection
Oriya	NLP for Oriya	Oriya Wikipedia Articles	26.57	26.81	IndicNLP News Article Classification Dataset - Oriya	98.83	0.98	Notebook	Oriya Embeddings Projection	Oriya Embeddings Projection
Sanskrit	NLP for Sanskrit	Sanskrit Wikipedia Articles	~6	~3	Sanskrit Shlokas Dataset	84.3 (valid set)			Sanskrit Embeddings projection	Sanskrit Embeddings projection
Nepali	NLP for Nepali	Nepali Wikipedia Articles	31.5	29.3	Nepali News Dataset	98.5 (valid set)			Nepali Embeddings projection	Nepali Embeddings projection
Urdu	NLP for Urdu	Urdu Wikipedia Articles	13.19	12.55	Urdu News Dataset	95.28 (valid set)			Urdu Embeddings projection	Urdu Embeddings projection
Telugu	NLP for Telugu	Telugu Wikipedia Articles	27.47	29.44	Telugu News Dataset<br><br><br>Telugu News Andhra Jyoti	95.4<br><br><br>92.09		Notebook <br><br><br>Notebook	Telugu Embeddings projection	Telugu Embeddings projection
Tanglish	NLP for Tanglish	Synthetic Tanglish Dataset	37.50	-	Dravidian Codemix HASOC @ FIRE 2020<br><br>Dravidian Codemix Sentiment Analysis @ FIRE 2020	F1 Score: 0.88<br><br>F1 Score: 0.62	-	Notebook<br><br>Notebook	Tanglish Embeddings Projection	-
Manglish	NLP for Manglish	Synthetic Manglish Dataset	45.84	-	Dravidian Codemix HASOC @ FIRE 2020<br><br>Dravidian Codemix Sentiment Analysis @ FIRE 2020	F1 Score: 0.74<br><br>F1 Score: 0.69	-	Notebook<br><br>Notebook	Manglish Embeddings Projection	-
Hinglish	NLP for Hinglish	Synthetic Hinglish Dataset	86.48	-	-	-	-	-	Hinglish Embeddings Projection	-

Note: English model has been directly taken from fast.ai

Effect of using Transfer Learning + Paraphrases from iNLTK

Language	Repository	Dataset used for Classification	Results on using<br>complete training set	Percentage Decrease <br>in Training set size	Results on using<br>reduced training set<br>without Paraphrases	Results on using<br>reduced training set<br>with Paraphrases
Hindi	NLP for Hindi	IIT Patna Movie Reviews	Accuracy: 57.74<br><br>MCC: 37.23	80% (2480 -> 496)	Accuracy: 47.74<br><br>MCC: 20.50	Accuracy: 56.13<br><br>MCC: 34.39
Bengali	NLP for Bengali	Bengali News Articles (Soham Articles)	Accuracy: 90.71<br><br>MCC: 87.92	99% (11284 -> 112)	Accuracy: 69.88<br><br>MCC: 61.56	Accuracy: 74.06<br><br>MCC: 65.08
Gujarati	NLP for Gujarati	iNLTK Headlines Corpus - Gujarati	Accuracy: 91.05<br><br>MCC: 86.09	90% (5269 -> 526)	Accuracy: 80.88<br><br>MCC: 70.18	Accuracy: 81.03<br><br>MCC: 70.44
Malayalam	NLP for Malayalam	iNLTK Headlines Corpus - Malayalam	Accuracy: 95.56<br><br>MCC: 93.29	90% (5036 -> 503)	Accuracy: 82.38<br><br>MCC: 73.47	Accuracy: 84.29<br><br>MCC: 76.36
Marathi	NLP for Marathi	iNLTK Headlines Corpus - Marathi	Accuracy: 92.40<br><br>MCC: 85.23	95% (9672 -> 483)	Accuracy: 84.13<br><br>MCC: 68.59	Accuracy: 84.55<br><br>MCC: 69.11
Tamil	NLP for Tamil	iNLTK Headlines Corpus - Tamil	Accuracy: 95.22<br><br>MCC: 92.70	95% (5346 -> 267)	Accuracy: 86.25<br><br>MCC: 79.42	Accuracy: 89.84<br><br>MCC: 84.63

For more details around implementation or to reproduce results, checkout respective repositories.

Contributing

Add a new language support

If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue here

Please checkout the steps I'd mentioned here for Telugu to begin with. They should be almost similar for other languages as well.

Improving models/using models for your own research

If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that.

Add new functionality

If you wish for a particular functionality in iNLTK - Start by checking/raising a issue here

What's next

..and being worked upon

Shout out if you want to help :)

Add Maithili support

..and NOT being worked upon

Shout out if you want to lead :)

Add NER support for all languages
Add Textual Entailment support for all languages
Work on a unified model for all the languages
POS support in iNLTK
Add translations - to and from languages in iNLTK + English

iNLTK's Appreciation

By Jeremy Howard on Twitter
By Sebastian Ruder on Twitter
By Vincent Boucher, By Philip Vollet, By Steve Nouri on LinkedIn
By Kanimozhi, By Soham, By Imaad on LinkedIn
iNLTK was trending on GitHub in May 2019

Citation

If you use this library in your research, please consider citing:

@inproceedings{arora-2020-inltk,
    title = "i{NLTK}: Natural Language Toolkit for Indic Languages",
    author = "Arora, Gaurav",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.10",
    doi = "10.18653/v1/2020.nlposs-1.10",
    pages = "66--71",
    abstract = "We present iNLTK, an open-source NLP library consisting of pre-trained language models and out-of-the-box support for Data Augmentation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic Languages. By using pre-trained models from iNLTK for text classification on publicly available datasets, we significantly outperform previously reported results. On these datasets, we also show that by using pre-trained models and data augmentation from iNLTK, we can achieve more than 95{\%} of the previous best performance by using less than 10{\%} of the training data. iNLTK is already being widely used by the community and has 40,000+ downloads, 600+ stars and 100+ forks on GitHub. The library is available at https://github.com/goru001/inltk.",
}