Home

Awesome

NLP for Tanglish (Code mixed Tamil+English)

This repository contains state of the art Language models and Classifier for Code mixed Tanglish (Tamil and English) - spoken in Indian sub-continent.

Dataset

  1. Tamil Wikipedia Articles : Preprocessed and Transliterated versions of this dataset, used for language modeling in this repo, can be downloaded directly from here

  2. Dravidian Codemix HASOC @ FIRE 2020

  3. Dravidian Codemix Sentiment Analysis @ FIRE 2020

Results

Language Model Perplexity (on validation set)

Architecture/DatasetTamil Wikipedia ArticlesVocab size
ULMFiT37.508000

Classification Metrics

ULMFiT
DatasetF1PrecisionRecallNotebook to Reproduce results
Dravidian Codemix HASOC @ FIRE 20200.880.880.88Link
Dravidian Codemix Sentiment Analysis @ FIRE 20200.620.650.69Link

Visualizations

Word Embeddings
ArchitectureVocab SizeVisualization
ULMFiT8kEmbeddings projection

Pretrained Models

Language Models

Download pretrained ULMFiT LM with 8k vocab from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here