Home

Awesome

NLP for Telugu

This repository contains State of the Art Language models and Classifier for Telugu language(spoken in Indian sub-continent)

The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)

Dataset

Created as part of this project

  1. Telugu Wikipedia Dataset

  2. Telugu News Dataset

  3. Telugu News Dataset II

Results

Language Model Perplexity

Architecture/DatasetTelugu Wikipedia Articles
ULMFiT27.47
TransformerXL29.44

Classification Metrics

ULMFiT
DatasetAccuracyKappa Score
Telugu News Articles95.493.8
Telugu News Articles - Andhra Jyoti92.09

Visualizations

Embedding Space
ArchitectureVisualization
ULMFiTEmbeddings projection
TransformerXLEmbeddings projection

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here