Home

Awesome

NLP for Kannada

This repository contains State of the Art Language models and Classifier for Kannada, which is spoken predominantly by Kannada people in India, mainly in the state of Karnataka.

The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)

Dataset

Created as part of this project

  1. Kannada Wikipedia Articles

  2. Kannada News Dataset

Open Source Datasets

  1. IndicNLP News Article Classification Dataset - Kannada

Results

Language Model Perplexity (on validation set)

Architecture/DatasetKannada Wikipedia Articles
ULMFiT70.10
TransformerXL61.97

Classification Metrics

ULMFiT
DatasetAccuracyMCCNotebook to Reproduce results
IndicNLP News Article Classification Dataset - Kannada98.8798.30Link

Visualizations

Word Embeddings
ArchitectureVisualization
ULMFiTEmbeddings projection
TransformerXLEmbeddings projection

Pretrained Models

Language Models

Download pretrained Language Model from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here