Home

Awesome

Glyph

This repository is used to publish all the code used for the following article:

Xiang Zhang, Yann LeCun, Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?, arXiv 1708.02657

The code and datasets are completely released as of January 2018, including all the code for crawling, preprocessing and training on the datasets. However, the documentation may not be complete yet. That said, readers could refer to the doc directory for an example in reproducing all the results for the Dianping dataset, and extend that to other datasets in similar ways.

Reproducibility Manifesto

If anyone sees a number in our paper, there is a script one can execute to reproduce it. No responsibility should be imposed on the user to figure out any experimental parameter barried in the paper's content.

Datasets

The data directory contains the preprocessing scripts for all the datasets used in the paper. These datasets are released separately of their processing source code. See below for details.

Summary

The following table is a summary of the datasets. Most of them have millions of samples for training.

DatasetLanguageClassesTrainTest
DianpingChinese22,000,000500,000
JD fullChinese53,000,000250,000
JD binaryChinese24,000,000360,000
Rakuten fullJapanese54,000,000500,000
Rakuten binaryJapanese23,400,000400,000
11st fullKorean5750,000100,000
11st binaryKorean24,000,000400,000
Amazon fullEnglish53,000,000650,000
Amazon binaryEnglish23,600,000400,000
IfengChinese5800,00050,000
ChinanewsChinese71,400,000112,000
NYTimesEnglish71,400,000105,000
Joint fullMultilingual510,750,0001,500,000
Joint binaryMultilingual215,000,0001,560,000

Download

Datasets are released separtely of the source code via links from Google Drive. These datasets should only be used for the purpose of research.

DatasetTrainTest
DianpingLinkLink
JD fullLinkLink
JD binaryLinkLink
Rakuten fullLinkLink
Rakuten binaryLinkLink
11st fullLinkLink
11st binaryLinkLink
Amazon fullLinkLink
Amazon binaryLinkLink
IfengLinkLink
ChinanewsLinkLink
NYTimesLinkLink
Joint fullLinkLink
Joint binaryLinkLink

GNU Unifont

The glyphnet scripts require the GNU Unifont character images to run. The file unifont-8.0.01.t7b.xz can be downloaded via this link.