Awesome
Summary Dataset
This a summary dataset. You can train abstractive summarization model using this dataset. It contains 3 files i.e.
train
, test
and val
. Data is in jsonl
format.
Every line
has these keys.
id
url
title
summary
text
You can easily read the data with pandas
import pandas as pd
test = pd.read_json("summary/urdu_test.jsonl", lines=True)
POS dataset
Urdu dataset for POS training. This is a small dataset and can be used for training parts of speech tagging for Urdu Language. Structure of the dataset is simple i.e.
word TAG
word TAG
The tagset used to build dataset is taken from Sajjad's Tagset
NER Datasets
Following are the datasets used for NER tasks.
UNER Dataset
Happy to announce that UNER (Urdu Named Entity Recognition) dataset is available for NLP apps. Following are NER tags which are used to build the dataset:
PERSON
LOCATION
ORGANIZATION
DATE
NUMBER
DESIGNATION
TIME
If you want to read more about the dataset check this paper Urdu NER.
NER Dataset is in utf-16
format.
MK-PUCIT Dataset
Latest for Urdu NER is available. Check this paper for more information MK-PUCIT.
Entities used in the dataset are
Other
Organization
Person
Location
MK-PUCIT
author also provided the Dropbox
link to download the data. Dropbox
IJNLP 2008 dataset
IJNLP dataset has following NER tags.
O
LOCATION
PERSON
TIME
ORGANIZATION
NUMBER
DESIGNATION
Jahangir dataset
Jahangir dataset has following NER tags.
O
PERSON
LOCATION
ORGANIZATION
DATE
TIME
Datasets for Sentiment Analysis
IMDB Urdu Movie Review Dataset.
This dataset is taken from IMDB Urdu. It was translated using Google Translator. It has only two labels i.e.
positive
negative
Roman Dataset
This dataset can be used for sentiment analysis for Roman Urdu. It has 3 classes for classification.
Neutral
Positive
Negative
If you need more information about this dataset checkout the link Roman Urdu Dataset.
Products & Services dataset
This dataset is collected from different sources like social media and web for various products and services for sentiment analysis. It contains 3 classes.
pos
neg
neu
Daraz Products dataset
This dataset consists of reviews taken from Daraz. You can use it for sentiment analysis as well as spam or ham classification. It contains following columns.
Product_ID
Date
Rating
Spam(1) and Not Spam(0)
Reviews
Sentiment
Features
Dataset is taken from kaggle daraz
Urdu Dataset
Here is a small dataset for sentiment analysis. It has following classifying labels
P
N
O
Link to the paper Paper GitHub link to data Urdu Corpus V1
News Datasets
Urdu News Dataset 1M
This dataset(news/urdu-news-dataset-1M.tar.xz
) is taken from Urdu News Dataset 1M. It has 4 classes and can be used for classification
and other NLP tasks. I have removed unnecessary columns.
Business & Economics
Entertainment
Science & Technology
Sports
Real-Fake News
This dataset(news/real_fake_news.tar.gz
) is used for classification of real and fake news in Fake News Dataset
Dataset contains following domain news.
Technology
Education
Business
Sports
Politics
Entertainment
News Headlines
Headlines(news/headlines.csv.tar.gz
) dataset is taken from Urd News Headlines. Original dataset is in Excel format,
I've converted to csv for experiments. Can be used for clustering and classification.
RAW corpus and models
COUNTER (COrpus of Urdu News TExt Reuse) Dataset
This dataset is collected from journalism and can be used for Urdu NLP research. Here is the link to the resource for more information COUNTER.
QA datasets
I have added two qa datasets, if someone wants to use it for QA based Chatbot. QA(Ahadis): qa_ahadis.csv
It contains qa pairs for Ahadis.
The dataset qa_gk.csv
it contains the general knowledge QA.
Urdu model for SpaCy
Urdu model for SpaCy is available now. You can use it to build NLP apps easily. Install the package in your working environment.
pip install ur_model-0.0.0.tar.gz
You can use it with following code.
import spacy
nlp = spacy.load("ur_model")
doc = nlp("میں خوش ہوں کے اردو ماڈل دستیاب ہے۔ ")
NLP Tutorials for Urdu
Checkout my articles related to Urdu NLP tasks
- POS Tagging Urdu POS Tagging using MLP
- NER How to build NER dataset for Urdu language?, Named Entity Recognition for Urdu
- Word 2 Vector How to build Word 2 Vector for Urdu language
- Word and Sentence Similarity Urdu Word and Sentence Similarity using SpaCy
- Tokenization Urdu Tokenization using SpaCy
- Urdu Language Model How to build Urdu language model in SpaCy
These articles are available on UrduNLP.
Some Helpful Tips
Download Single file from GitHub
If you want to get only raw files(text or code) then use curl command i.e.
curl -LJO https://github.com/mirfan899/Urdu/blob/master/ner/uner.txt
Concatenate files
cd data
cat */*.txt > file_name.txt
MK-PUCIT
Concatenate files of MK-PUCIT into single file using.
cat */*.txt > file_name.txt
Original dataset has a bug like Others
and Other
which are same entities, if you want to use the dataset
from dropbox
link, use following commands to clean it.
import pandas as pd
data = pd.read_csv('ner/mk-pucit.txt', sep='\t', names={"tag", "word"})
data.tag.replace({"Others":"Other"}, inplace=True)
# save according you need as csv or txt by changing the extension
data.to_csv("ner/mk-pucit.txt", index=False, header=False, sep='\t')
Now csv/txt file has format
word tag
Note
If you have a dataset(link) and want to contribute, feel free to create PR.