Awesome
prachathai-67k
News Article Corpus from Prachathai.com
The prachathai-67k
dataset was scraped from the news site Prachathai. We filtered out those articles with less than 500 characters of body text, mostly images and cartoons. It contains 67,889 articles wtih 51,797 tags from August 24, 2004 to November 15, 2018. The dataset was originally scraped by @lukkiddd and cleaned by @cstorm125. Download the dataset here. You can also see preliminary exploration in exploration.ipynb
.
This dataset is a part of pyThaiNLP Thai text classification-benchmarks. For the benchmark, we selected the following tags with substantial volume that resemble classifying types of articles※:
การเมือง
- politicsสิทธิมนุษยชน
- human rightsคุณภาพชีวิต
- quality of lifeต่างประเทศ
- internationalสังคม
- socialสิ่งแวดล้อม
- environmentเศรษฐกิจ
- economicsวัฒนธรรม
- cultureแรงงาน
- laborความมั่นคง
- national securityไอซีที
- ICTการศึกษา
- education
We provide 3 benchmarks for 12-topic multi-label classification of prachathai-67k
: fastText, LinearSVC, ULMFit, and Multilingual Universal Sentence Encoder . In all cases, we first finetune the embeddings using all data. The data is then split into train, validation and test sets at 70/10/20 split. The benchmark numbers are based on the test set. Performance metrics are macro-averaged accuracy and F1 score. See classification.ipynb for more information.
model | macro-accuracy | macro-F1 |
---|---|---|
fastText | 0.9302 | 0.5529 |
LinearSVC | 0.513277 | 0.552801 |
ULMFit | 0.948737 | 0.744875 |
USE | 0.856091 | 0.696172 |
※ Note that Prachathai.com is a left-leaning, human-right-focused news site, and thus unusual news labels such as human rights and quality of life.