Awesome

NLP for Hindi

This repository contains State of the Art Language models and Classifier for Hindi language (spoken in Indian sub-continent).

BBC News Articles : Sentiment analysis corpus for Hindi documents extracted from BBC news website.
IIT Patna Product Reviews : Sentiment analysis corpus for product reviews posted in Hindi.
IIT Patna Movie Reviews : Sentiment analysis corpus for movie reviews posted in Hindi.

Architecture/Dataset	Hindi Wikipedia Articles - 172k	Hindi Wikipedia Articles - 55k
ULMFiT	34.06	35.87
TransformerXL	26.09	34.78

Note: Nirant has done previous SOTA work with Hindi Language Model and achieved perplexity of ~46. The scores above aren't directly comparable with his score because his train and validation set were different and they aren't available for reproducibility

Dataset	Accuracy	MCC	Notebook to Reproduce results
BBC News Articles	78.75	71.61	Link
IIT Patna Movie Reviews	57.74	37.23	Link
IIT Patna Product Reviews	75.71	59.76	Link

Architecture	Visualization
ULMFiT	Embeddings projection
TransformerXL	Embeddings projection

Architecture	Visualization
ULMFiT	Encodings projection

Dataset	Dataset size (train, valid, test)	Accuracy	MCC	Notebook to Reproduce results
IIT Patna Movie Reviews	(2480, 310, 310)	57.74	37.23	Link

Dataset	Dataset size (train, valid, test)	Accuracy	MCC	Notebook to Reproduce results
IIT Patna Movie Reviews	(496, 310, 310)	47.74	20.50	Link

Dataset	Dataset size (train, valid, test)	Accuracy	MCC	Notebook to Reproduce results
IIT Patna Movie Reviews	(496, 310, 310)	56.13	34.39	Link

Download pretrained Language Models of ULMFiT, TransformerXL trained on Hindi Wikipedia Articles - 172k and Hindi Wikipedia Articles - 55k from here

Unsupervised training using Google's sentencepiece

Download the trained model and vocabulary from here