Awesome

Glyph

This repository is used to publish all the code used for the following article:

Xiang Zhang, Yann LeCun, Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?, arXiv 1708.02657

The code and datasets are completely released as of January 2018, including all the code for crawling, preprocessing and training on the datasets. However, the documentation may not be complete yet. That said, readers could refer to the doc directory for an example in reproducing all the results for the Dianping dataset, and extend that to other datasets in similar ways.

Reproducibility Manifesto

If anyone sees a number in our paper, there is a script one can execute to reproduce it. No responsibility should be imposed on the user to figure out any experimental parameter barried in the paper's content.

Datasets

The data directory contains the preprocessing scripts for all the datasets used in the paper. These datasets are released separately of their processing source code. See below for details.

Summary

The following table is a summary of the datasets. Most of them have millions of samples for training.

Dataset	Language	Classes	Train	Test
Dianping	Chinese	2	2,000,000	500,000
JD full	Chinese	5	3,000,000	250,000
JD binary	Chinese	2	4,000,000	360,000
Rakuten full	Japanese	5	4,000,000	500,000
Rakuten binary	Japanese	2	3,400,000	400,000
11st full	Korean	5	750,000	100,000
11st binary	Korean	2	4,000,000	400,000
Amazon full	English	5	3,000,000	650,000
Amazon binary	English	2	3,600,000	400,000
Ifeng	Chinese	5	800,000	50,000
Chinanews	Chinese	7	1,400,000	112,000
NYTimes	English	7	1,400,000	105,000
Joint full	Multilingual	5	10,750,000	1,500,000
Joint binary	Multilingual	2	15,000,000	1,560,000

Download

Datasets are released separtely of the source code via links from Google Drive. These datasets should only be used for the purpose of research.

Dataset	Train	Test
Dianping	Link	Link
JD full	Link	Link
JD binary	Link	Link
Rakuten full	Link	Link
Rakuten binary	Link	Link
11st full	Link	Link
11st binary	Link	Link
Amazon full	Link	Link
Amazon binary	Link	Link
Ifeng	Link	Link
Chinanews	Link	Link
NYTimes	Link	Link
Joint full	Link	Link
Joint binary	Link	Link

GNU Unifont

The glyphnet scripts require the GNU Unifont character images to run. The file unifont-8.0.01.t7b.xz can be downloaded via this link.