Home

Awesome

Saudi Newspapers Corpus (SaudiNewsNet)

This repo contains a set of 31,030 Arabic newspaper articles alongwith metadata, extracted from various online Saudi newspapers.

File Structure

The folder dataset contains a set of ZIP files, where each file has the format YYYY-MM-DD.zip and contains one JSON file with a corresponding name YYYY-MM-DD.json. The JSON files are stored in UTF-8 encoding.

Each JSON file contains an array of articles (the format of each article is explained in the next section), and its file name reflects the date on which the contained articles were extracted.

Article JSON Object Format

The JSON object for each article contains the following fields (some fileds can have empty values in case the crawler failed to extract them):

Statistics

The dataset currently contains 31,030 Arabic articles (with a total number of 8,758,976 words). The articles were extracted from the following Saudi newspapers (sorted by number of articles):

Citing this Work

If you'd like to cite this work, you may use one of the following. You may also contact me (mazen [dot] abdulaziz [at] gmail [dot] com) so that I can include your research in the "referring work" section.

Contacting the Maintainer

If you'd like to cite this work; have comments or thoughts to share; or just feel like chatting then feel free to contact me on either:

Changelog

License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The dataset is shared for the sole purpose of aiding open scientific research (in Arabic computing or linguistics), and can only be used for that purpose. The ownership of each article within the dataset belongs to the respective newspaper from which it was extracted; and the maintainer of the repository does not claim ownership of any of the content within it. If you think, by any means, that this dataset breaches any established copyrights; please contact the repository maintainer.