Awesome

Maḵẖzan مخزن

An Urdu text corpus to enable research and applications for the Urdu language. We believe Maḵẖzan is the best Urdu dataset to start work for Urdu NLP.

This dataset currently comprises 6.26 million words of Urdu text. We have selected source text that we believe to have gone through strong editorial standards, to preserve linguistic integrity. The text is then syntactically marked up, so that headings, paragraphs, and lists can be identified. Metadata is added to each file so data can be intelligently filtered and selected. We annotate non-Urdu text included in source publications. Data also goes through an intense cleaning process to make the text easier to read for software, as well as correcting typograghical errors.

Maḵẖzan is free to use for all commercial and non-commercial purposes. To protect writers' whose writing is part of this dataset, we ask that you not republish the raw text (please see License information below). If you end up using Maḵẖzan for your work, we'd love to hear about it.

Navigating this repository

/docs: Documentation
/scripts: Scripts to analyze the text, constructed as a Swift package.
/stats: Output of text analyses, which can be used out of the box to power NLP applications, such as word and n-gram frequencies.
/text: The text corpus itself, consisting of XML files.
/tools: Command line tools for data cleaning. These are of use to help improve the quality of Maḵẖzan

Contribution

We would love your help in a number of ways. Please get in touch if you:

have any text to contribute to this repository
spot an error in the text
write a script that should be distributed with the corpus

Start an issue on this repository or get in touch through our website.

Contributors

We are grateful to our current contributors

Engineering

Waqas Ali
Hassan Talat
Shaoor Munir
Muhammad Haroon

Data Editing

Hamza Safdar
Akbar Zaman

Text Contributions

Gurmani Center at LUMS
IBC Urdu
Al-Mawrid Institute

Citation

@misc{makhzan,
title={Maḵẖzan},
howpublished = "\url{https://github.com/zeerakahmed/makhzan/}",
}

License / Copyright

Material in the `/text` directory

All files in the /text directory are covered under standard copyright. Each piece of text has been included in this repository with explicity permission of respective copyright holders, who are identified in the <meta> tag for each file. You are free to use this text for analysis, research and development, but you are not allowed to redistribute or republish this text. Where possible we encourage that forks of this repository be kept private unless explicit permission is granted.

Some cases where a less restrictive license could apply to files in the /text directory are presented below.

In some cases copyright free text has been digitally reproduced through the hard work of our collaborators. In such cases we have credited the appropriate people where possible in a <notes> field in the file's metadata, and we strongly encourage you to contact them before redistributing this text in any form.

Where a separate license is provided along with the text, we have provided corresponding data in the <publication> field in a file's metadata.

All other materials

All other materials in this repository (such as software, aggregated analyses and documentation) in the /scripts or /stats directory are licensed under the terms of the MIT license.

Copyright concerns

If you feel any material has been included in this repository erroneously and/or copyright arrangements have not been respected, please file an issue on this repository or get in touch through our website.