Awesome
NLP for Tanglish (Code mixed Tamil+English)
This repository contains state of the art Language models and Classifier for Code mixed Tanglish (Tamil and English) - spoken in Indian sub-continent.
Dataset
-
Tamil Wikipedia Articles : Preprocessed and Transliterated versions of this dataset, used for language modeling in this repo, can be downloaded directly from here
Results
Language Model Perplexity (on validation set)
Architecture/Dataset | Tamil Wikipedia Articles | Vocab size |
---|---|---|
ULMFiT | 37.50 | 8000 |
Classification Metrics
ULMFiT
Dataset | F1 | Precision | Recall | Notebook to Reproduce results |
---|---|---|---|---|
Dravidian Codemix HASOC @ FIRE 2020 | 0.88 | 0.88 | 0.88 | Link |
Dravidian Codemix Sentiment Analysis @ FIRE 2020 | 0.62 | 0.65 | 0.69 | Link |
Visualizations
Word Embeddings
Architecture | Vocab Size | Visualization |
---|---|---|
ULMFiT | 8k | Embeddings projection |
Pretrained Models
Language Models
Download pretrained ULMFiT LM with 8k vocab from here
Tokenizer
Trained tokenizer using Google's sentencepiece
Download the trained model and vocabulary from here