Home

Awesome

Dataset is arranged as authors-> [en , ur, hi] -> ghazals/poems

[en, ur , hi] signify english translieration and urdu , hindi text

Why is this interesting ? Urdu is a low resource language in NLP. Compared to English, which could have hundreds of thousands of articles floating around on the internet, there is not much content for Urdu, to train ML language models .

Ghazal is a form of poetry popular in South Asia.

In terms of NLP, it provides interesting possiblities for future testing of language models.

Source: https://en.wikipedia.org/wiki/Ghazal

I want to highlight an important point at this momement. 4Mb of text data is nothing compared to what transformer based models actually need.

common crawl dataset is a giant repository of free text data in more than 40 languages. If you actually want to train a transformer model from scratch, you would need data in order of millions of text files. And for that it would be best to start with one of these big data tools.

===============================================

All data credits belong to the wonderful work done by Rekhta foundation. Link: https://www.rekhta.org/

Data has been parsed into Urdu, Hindi and English translieration thanks to their excellent webpage. Consider supporting them for their great work in pushing the urdu language.

Credits to these authors for their wonderful original creations:

'mirza-ghalib','allama-iqbal','faiz-ahmad-faiz','sahir-ludhianvi','meer-taqi-meer', 'dagh-dehlvi','kaifi-azmi','gulzar','bahadur-shah-zafar','parveen-shakir', 'jaan-nisar-akhtar','javed-akhtar','jigar-moradabadi','jaun-eliya', 'ahmad-faraz','meer-anees','mohsin-naqvi','firaq-gorakhpuri','fahmida-riaz','wali-mohammad-wali', 'waseem-barelvi','akbar-allahabadi','altaf-hussain-hali','ameer-khusrau','naji-shakir','naseer-turabi', 'nazm-tabatabai','nida-fazli','noon-meem-rashid', 'habib-jalib'

===============================================

If you would want to extend the size of this dataset, do a fork of this repository. There is scope of improvement because currently this simple parsing only looks at a hand curated list of authors. There can be better ways of automating the task.