Home

Awesome

Thai Text Classification Benchmarks

We provide 4 datasets for Thai text classification in different styles, objectives, and number of labels. We also created some preliminary benchmarks using fastText, linear models (linearSVC and logistic regression), and thai2fit's implementation of ULMFit.

prachathai-67k, truevoice-intent, and all code in this repository are released under Apache License 2.0 by pyThaiNLP. wisesight-sentiment is released to public domain, using Creative Commons Zero v1.0 Universal license, by Wisesight. wongnai-corpus is released under GNU Lesser General Public License v3.0 by Wongnai.

Dataset Description

DatasetsStyleObjectiveLabelsSize
prachathai-67k: body_textFormal (online newspapers), NewsTopic1267k
truevoice-intent: destinationInformal (call center transcription), Customer serviceIntent716k
wisesight-sentimentInformal (social media), Conversation/opinionSentiment428k
wongnai-corpusInformal (review site), Restuarant reviewSentiment540k

prachathai-67k: body_text

We benchmark prachathai-67k by using body_text as text features and construct a 12-label multi-label classification. The performance is measured by macro-averaged accuracy and F1 score. Codes can be run to confirm performance at this notebook. We also provide performance metrics by class in the notebook.

modelmacro-accuracymacro-F1
fastText0.93020.5529
LinearSVC0.5132770.552801
ULMFit0.9487370.744875
USE0.8560910.696172

truevoice-intent: destination

We benchmark truevoice-intent by using destination as target and construct a 7-class multi-class classification. The performance is measured by micro-averaged and macro-averaged accuracy and F1 score. Codes can be run to confirm performance at this notebook. We also provide performance metrics by class in the notebook.

modelmacro-accuracymicro-accuracymacro-F1micro-F1
LinearSVC0.9578060.957477120.8694110.85116993
ULMFit0.9550660.842731110.8521490.84273111
BERT0.89210.850.870.85
USE0.9435590.943558550.7876860.802455

wisesight-sentiment

Performance of wisesight-sentiment is based on the test set of WISESIGHT Sentiment Analysis. Codes can be run to confirm performance at this notebook.

Disclaimer Note that the labels are obtained manually and are prone to errors so if you are planning to apply the models in the benchmark for real-world applications, be sure to benchmark it with your own dataset.

ModelPublic AccuracyPrivate Accuracy
Logistic Regression0.727810.7499
FastText0.631440.6131
ULMFit0.712590.74194
ULMFit Semi-supervised0.731190.75859
ULMFit Semi-supervised Repeated One Time0.733720.75968
USE0.63987*

wongnai-corpus

Performance of wongnai-corpus is based on the test set of Wongnai Challenge: Review Rating Prediction. Codes can be run to confirm performance at this notebook.

ModelPublic Micro-F1Private Micro-F1
ULMFit Knight0.611090.62580
ULMFit0.593130.60322
fastText0.51450.5109
LinearSVC0.50220.4976
Kaggle Score0.591390.58139
BERT0.566120.57057
USE0.426880.41031

BibTeX

@software{cstorm125_2020_3852912,
  author       = {cstorm125 and
                  lukkiddd},
  title        = {PyThaiNLP/classification-benchmarks: v0.1-alpha},
  month        = may,
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.1-alpha},
  doi          = {10.5281/zenodo.3852912},
  url          = {https://doi.org/10.5281/zenodo.3852912}
}

Acknowledgements