Home

Awesome

NLP for Hinglish (Code mixed Hindi+English)

This repository contains Language model for Code mixed Hinglish (Hindi and English) - spoken in Indian sub-continent.

Methodology followed in this repo is detailed in this paper, accepted at Dravidian-Codemix-HASOC2020@FIRE2020

Dataset

  1. Synthetically Generated Hinglish Dataset from Wikipedia Articles

Results

Language Model Perplexity (on validation set)

Architecture/DatasetSynthetically Generated Wikipedia Articles Dataset
ULMFiT86.48

Visualizations

Word Embeddings
ArchitectureVisualization
ULMFiTEmbeddings projection

Pretrained Models

Language Models

Download pretrained ULMFiT LM from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here