Home

Awesome

prachathai-67k

News Article Corpus from Prachathai.com

The prachathai-67k dataset was scraped from the news site Prachathai. We filtered out those articles with less than 500 characters of body text, mostly images and cartoons. It contains 67,889 articles wtih 51,797 tags from August 24, 2004 to November 15, 2018. The dataset was originally scraped by @lukkiddd and cleaned by @cstorm125. Download the dataset here. You can also see preliminary exploration in exploration.ipynb.

This dataset is a part of pyThaiNLP Thai text classification-benchmarks. For the benchmark, we selected the following tags with substantial volume that resemble classifying types of articles※:

We provide 3 benchmarks for 12-topic multi-label classification of prachathai-67k: fastText, LinearSVC, ULMFit, and Multilingual Universal Sentence Encoder . In all cases, we first finetune the embeddings using all data. The data is then split into train, validation and test sets at 70/10/20 split. The benchmark numbers are based on the test set. Performance metrics are macro-averaged accuracy and F1 score. See classification.ipynb for more information.

modelmacro-accuracymacro-F1
fastText0.93020.5529
LinearSVC0.5132770.552801
ULMFit0.9487370.744875
USE0.8560910.696172

※ Note that Prachathai.com is a left-leaning, human-right-focused news site, and thus unusual news labels such as human rights and quality of life.