Home

Awesome

NLP for Urdu

This repository contains State of the Art Language models and Classifier for Urdu, spoken mainly in Pakistan and India, and also in Nepal, Bangladesh and several other countries.

The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)

Dataset

Created as part of this project

  1. Urdu Wikipedia Articles

  2. Urdu News Dataset

Results

Language Model Perplexity

Architecture/DatasetUrdu Wikipedia Articles
ULMFiT13.19
TransformerXL12.55

Classification Metrics

ULMFiT
DatasetAccuracyKappa Score
Urdu News Dataset95.2891.58

Visualizations

Embedding Space
ArchitectureVisualization
ULMFiTEmbeddings projection
TransformerXLEmbeddings projection

Pretrained Language Model

Download pretrained ULMFiT LM from here

Download pretrained TransformerXL LM from here

Classifier

Download classifier from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here

Credits

NLP for Marathi