Awesome
NLP for Hinglish (Code mixed Hindi+English)
This repository contains Language model for Code mixed Hinglish (Hindi and English) - spoken in Indian sub-continent.
Methodology followed in this repo is detailed in this paper, accepted at Dravidian-Codemix-HASOC2020@FIRE2020
Dataset
Results
Language Model Perplexity (on validation set)
Architecture/Dataset | Synthetically Generated Wikipedia Articles Dataset |
---|---|
ULMFiT | 86.48 |
Visualizations
Word Embeddings
Architecture | Visualization |
---|---|
ULMFiT | Embeddings projection |
Pretrained Models
Language Models
Download pretrained ULMFiT LM from here
Tokenizer
Trained tokenizer using Google's sentencepiece
Download the trained model and vocabulary from here