Home

Awesome

NLP for Tamil

This repository contains State of the Art Language models and Classifier for Tamil language, which is spoken in India, Srilanka, Malaysia and Singapore.

The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)

Dataset

Created as part of this project

  1. Tamil Wikipedia Articles

  2. Tamil News Dataset

Open Source Datasets

  1. iNLTK Headlines Corpus - Tamil : Uses Tamil News Dataset prepared above.

Results

Language Model Perplexity (on validation set)

Architecture/DatasetTamil Wikipedia ArticlesVocab size
ULMFiT19.808k
TransformerXL18.918k
TransformerXL17.2216k

Classification Metrics

ULMFiT
DatasetAccuracyMCCNotebook to Reproduce results
iNLTK Headlines Corpus - Tamil95.2292.70Link

Visualizations

Word Embeddings
ArchitectureVocab SizeVisualization
ULMFiT8kEmbeddings projection
TransformerXL8kEmbeddings projection
TransformerXL16kEmbeddings projection

Results of using Transfer Learning + Data Augmentation from iNLTK

On using complete training set (with Transfer learning)
DatasetDataset size (train, valid, test)AccuracyMCCNotebook to Reproduce results
iNLTK Headlines Corpus - Tamil(5346, 669, 669)95.2292.70Link
On using 5% of training set (with Transfer learning)
DatasetDataset size (train, valid, test)AccuracyMCCNotebook to Reproduce results
iNLTK Headlines Corpus - Tamil(267, 669, 669)86.2579.42Link
On using 5% of training set (with Transfer learning + Data Augmentation)
DatasetDataset size (train, valid, test)AccuracyMCCNotebook to Reproduce results
iNLTK Headlines Corpus - Tamil(267, 669, 669)89.8484.63Link

Pretrained Models

Language Models

Download pretrained ULMFiT LM with 8k vocab from here

Download pretrained TransformerXL LM with 8k vocab from here

Download pretrained TransformerXL LM with 16k vocab from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here