Awesome
Awesome Urdu
A curated list of resources dedicated to Urdu language.
Maintainers - Ikram Ali
Please read the contribution guidelines before contributing.
Please feel free to create pull requests
Urdu Datasets
General NLP Datasets
- Web news Data - Urdu Web news Data
- Roman Urdu Dataset - Data for sentiment analysis, along with misc compiled data for Roman Urdu
- Collection of Urdu Datasets - Datasets for POS, NER and NLP tasks
- Urdu Universal Dependency Treebank
- UrduSummary Corpus Benchmark, 2016
- Rekhta Ghazals
- Urdu Paraphrase Plagiarism Corpus, 2016
- Derived from: COrpus of Urdu News TExt Reuse (CoUNTeR), 2016
- Extension: Urdu Short Text Reuse Corpus (USTRC), 2018
- TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages
- Flickr8k Urdu Image-Caption Generation Dataset, 2020
- mLAMA: multilingual LAnguage Model Analysis, 2021
- Urdu Word Segmentation using CRF, 2018
- Apertium linguistic data for Urdu
Urdu Text Classification
- Fake News Classification, 2020 (Old Version)
- iNLTK Urdu News Headlines Classification Benchmark, 2020
- Express News Headlines+Summary, 2019
Urdu Named-Entity Recognition
Urdu Monolingual Corpora
- UFAL Corpus, 2014 - 5.4M sentences (with POS tags)
- CommonCrawl
- OSCAr Corpus, 2020
- CC-100 Corpus, 2019 - CC crawls from Jan-Dec 2018
- WMT Raw 2017 - CC crawls from 2012-2016
- https://dumps.wikimedia.org/urwiki/
- Processed Dumps: iNLTK Wiki Articles, 2020, Tatoeba Challenge, 2020, 2016 UrduWikiCorpus
- To process the latest dump yourself, use a library like WiToKit
- Leipzig Corpora
- Maḵẖzan
- Commercial-licensed corpora
Urdu Sentiment Datasets
- Urdu IMDb Movie Reviews - IMDB Movie Reviews data in Urdu
- Urdu Sentiment Benchmark, 2020
- 2010 Disaster Response Messages
- Lexicon
- Roman Urdu
- Hate Speech & Offensive Language Detection, 2020 - 10k tweets
- UCI Roman-Urdu Sentiment Classification, 2018 - 20k records
- Did You Offend Me? Classification of Offensive Tweets, 2018 - 3k tweets
Urdu OCR Datasets
- Qaida - Synthetic datasets and pre-trained models
- U-HAT - Urdu Hand-Written Text Dataset
- 45K+ Clean-Background-Urdu-Ligatures-Dataset, 2019
- IIIT-Hyderabad: Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks, 2017
- CLE Pakistan Urdu Image Corpora (Corresponding texts)
- Cursive-Text: A Benchmark for Urdu Text Recognition in Natural Scene Images, 2020 - 2500 images, email for dataset
Urdu Parallel Corpora for Machine Translation
- OPUS Corpora (Select en->ur)
- Contains: CC-Aligned, Tanzil, JW300, OpenSubtitles, TED, QED, etc.
- IIIT-Hyderabad MT Bhasha
- Contains Mann ki Baat and Press Information Bureau datasets
- PM India Parallel Corpus
- English-Urdu Religious Parallel Corpus
- Anuvaad Parallel Corpora
- MechanicalTurks 2012 Parallel Corpora
- Urdu-Nepali-English Parallel Corpus (Test set here)
- Cross-Language English-Urdu (CLEU) Corpus, 2018
- Flickr 8k Benchmark - 2.7k sentences
- Universal Declaration of Human Rights (benchmark)
- Commercial-licensed corpora
- EMILLE/CIIL Corpus - Contains monolingual data as well
- National Platform for Language Technology
- Technology Development for Indian Languages (Search "Urdu Corpus")
Urdu Transliteration Datasets
- Google Dakshina, 2020
- TRANSLIT: A Large-scale Name Transliteration Resource, 2020
- Roman to Urdu Transliteration Sentences, 2020 (Drive Link available on request)
- Roman-Urdu Conversion Data
- Trilingual Ur-RomUr-Eng Dict, 2019
Urdu Lexical Resources
- Offline Eng-Urd Dictionary DB
- UrduHack Words-List - Includes N-grams, NER Labels
- CLE Urdu WordNet (Demo, PDF)
- CLE Urdu Verb List, Words List, Most Frequent Words
- IndoWordnet Parallel Corpus (API - pyiwn, Demo)
- MTurks-10k Multilingual Dictionary, 2014
- Microsoft IT Terminology
- Urdu N-grams, 2020 - Uni-Gram, Bi-Gram, Tri-Gram and Tetra-Gram
- CLE Urdu Books N-Grams
- Roman Urdu Lexical Normalization, 2019
Urdu Speech Datasets
- Urdu 250 Isolated Words, 2018
- CLE Phonetically Rich Urdu Speech Corpus
- CMU Wilderness Speech Dataset, 2019
- FCBH Recordings
- LibriVox AudioBooks
- Commercial-licensed corpora
- CLE Pakistan Urdu Speech Corpus (Main website)
- LDC UPenn Datasets - Filter search by selecting language
- Urdu Raw Speech Corpus, LDCIL
- LDCIL ASR Corpus
- Emotion
Cross-lingual Datasets
- Cross-lingual Natural Language Inference (XNLI) Corpus, 2020
- Google XTrEME Benchmark, 2020 - Evaluation of cross-lingual generalization of multilingual models
- Urdu-Punjabi Pairs, Apertium
Urdu NLP Tools, Libraries and Models
- UrduHack
- PronouncUR - Urdu words to pronouniciations format
- iNLTK
- Indic PoS/NER Tagger
- Urdu Morphological Analyzer, IIIT Hyderabad
- EasyOCR
Language Models
- HuggingFace Models
- Google Multilingual-T5, 2020
- Google MuRIL, 2020
- iNLTK Models, 2019
- XLM-RoBERTa, 2019
- Multilingual BERT, 2019
Word Embeddings
- UrduHack Word-Vectors, 2019 - Word2Vec and FastText models
- Facebook FastText models: Wiki-2016, CC+Wiki-2017, Multilingual Aligned, 2017
- BPEmb: Subword Embeddings, 2017 (Multilingual Aligned)
- ConceptNet Embeddings, 2017
- Polyglot Embeddings, 2013
Translation Models
- IL-Multi, 2020
- Facebook M2M-100, 2020
- Python Translators Services - Library to use Google, Bing, etc. translators for free
Transliteration Libraries
- PolyGlot
- LibIndicTrans - Transliterate Roman/Hindi to Urdu and vice-versa
- AksharaMukhi - Devanagari (Hindi) to Urdu script converter
- Google Transliterate API - Roman Urdu to Perso-Arabic
Online Resources/Services
Urdu News websites
Dictionaries
- ur.oxforddictionaries.com - Oxford Dictionary
- English Urdu Dictionary - English Urdu Dictionary
- Urdu English Dictionary 2 - Urdu English Dictionary 2