Awesome
NLP for Hindi
This repository contains State of the Art Language models and Classifier for Hindi language (spoken in Indian sub-continent).
The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)
Dataset
Created as part of this project
Open Source Datasets
-
BBC News Articles : Sentiment analysis corpus for Hindi documents extracted from BBC news website.
-
IIT Patna Product Reviews : Sentiment analysis corpus for product reviews posted in Hindi.
-
IIT Patna Movie Reviews : Sentiment analysis corpus for movie reviews posted in Hindi.
Results
Language Model Perplexity (on validation set)
Architecture/Dataset | Hindi Wikipedia Articles - 172k | Hindi Wikipedia Articles - 55k |
---|---|---|
ULMFiT | 34.06 | 35.87 |
TransformerXL | 26.09 | 34.78 |
Note: Nirant has done previous SOTA work with Hindi Language Model and achieved perplexity of ~46. The scores above aren't directly comparable with his score because his train and validation set were different and they aren't available for reproducibility
Classification Metrics
ULMFiT
Dataset | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|
BBC News Articles | 78.75 | 71.61 | Link |
IIT Patna Movie Reviews | 57.74 | 37.23 | Link |
IIT Patna Product Reviews | 75.71 | 59.76 | Link |
Visualizations
Word Embeddings
Architecture | Visualization |
---|---|
ULMFiT | Embeddings projection |
TransformerXL | Embeddings projection |
Sentence Embeddings
Architecture | Visualization |
---|---|
ULMFiT | Encodings projection |
Results of using Transfer Learning + Data Augmentation from iNLTK
On using complete training set (with Transfer learning)
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
IIT Patna Movie Reviews | (2480, 310, 310) | 57.74 | 37.23 | Link |
On using 20% of training set (with Transfer learning)
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
IIT Patna Movie Reviews | (496, 310, 310) | 47.74 | 20.50 | Link |
On using 20% of training set (with Transfer learning + Data Augmentation)
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
IIT Patna Movie Reviews | (496, 310, 310) | 56.13 | 34.39 | Link |
Pretrained Models
Language Models
Download pretrained Language Models of ULMFiT, TransformerXL trained on Hindi Wikipedia Articles - 172k and Hindi Wikipedia Articles - 55k from here
Tokenizer
Unsupervised training using Google's sentencepiece
Download the trained model and vocabulary from here