Home

Awesome

Thai BEST2010 Corpus Cooker

best2010_cooker

A tool for extracting segmented words from Thai segmented BEST2010 corpus.

Data formats

Usage

usage: cooker.py [-h] [--quiet] --input_data INPUT_DATA
                 [--output_data OUTPUT_DATA]
                 [--input_data_format INPUT_DATA_FORMAT]
                 [--output_data_format OUTPUT_DATA_FORMAT]
                 [--sentence_len_threshold SENTENCE_LEN_THRESHOLD] [--denoise]
                 [--gen_common_vocab] [--gen_ne_vocab] [--gen_ab_vocab]
                 [--exclude_empty_line] [--gen_div_data]

optional arguments:
  -h, --help            show this help message and exit
  --quiet, -q           Do not report on screen
  --input_data INPUT_DATA, -i INPUT_DATA
                        File path to input data
  --output_data OUTPUT_DATA, -o OUTPUT_DATA
                        File path to output data
  --input_data_format INPUT_DATA_FORMAT, -f INPUT_DATA_FORMAT
                        Choose format of input data among from 'txt' (Default:
                        txt)
  --output_data_format OUTPUT_DATA_FORMAT
                        Choose format of output data among from 'wl' and 'sl'
                        (Default: sl)
  --sentence_len_threshold SENTENCE_LEN_THRESHOLD
                        Sentence length threshold. Sentences whose length are
                        lower than the threshold are ignored (Default: 1)
  --denoise             Specify to deeply clean the input data (original data
                        denoising) see README.md for more details
  --gen_common_vocab    Specify to generate common vocabulary
  --gen_ne_vocab        Specify to generate NE vocabulary
  --gen_ab_vocab        Specify to generate abbreviation (AB) vocabulary
  --exclude_empty_line  Specify to exclude empty line
  --gen_div_data        Specify to generate division data (train, valid, test)
                        see README.md for more details

Example outputs

Start time: 20210609_1919

### arguments
# quiet=False
# input_data=data/best2010-sample.txt
# output_data=cooked
# input_data_format=txt
# output_data_format=sl
# sentence_len_threshold=1
# denoise=True
# gen_common_vocab=True
# gen_ne_vocab=True
# gen_ab_vocab=True
# exclude_empty_line=False
# gen_div_data=True

save cooked data: cooked/20210609_1919/cooked_best2010_20210609_1919.sl
save cooked train data: cooked/20210609_1919/cooked_best2010_20210609_1919.train.sl
save cooked valid data: cooked/20210609_1919/cooked_best2010_20210609_1919.valid.sl
save cooked test data: cooked/20210609_1919/cooked_best2010_20210609_1919.test.sl
save cooked train common vocab data: cooked/20210609_1919/cooked_best2010_20210609_1919.train.vocab
save cooked valid common vocab data: cooked/20210609_1919/cooked_best2010_20210609_1919.valid.vocab
save cooked test common vocab data: cooked/20210609_1919/cooked_best2010_20210609_1919.test.vocab
save cooked train NE data: cooked/20210609_1919/cooked_best2010_20210609_1919.train.ne.vocab
save cooked valid NE data: cooked/20210609_1919/cooked_best2010_20210609_1919.valid.ne.vocab
save cooked test NE data: cooked/20210609_1919/cooked_best2010_20210609_1919.test.ne.vocab
save cooked train AB data: cooked/20210609_1919/cooked_best2010_20210609_1919.train.ab.vocab
save cooked valid AB  data: cooked/20210609_1919/cooked_best2010_20210609_1919.valid.ab.vocab
save cooked test AB data: cooked/20210609_1919/cooked_best2010_20210609_1919.test.ab.vocab
save cooked common vocab data: cooked/20210609_1919/cooked_best2010_20210609_1919.vocab
save cooked NE data: cooked/20210609_1919/cooked_best2010_20210609_1919.ne.vocab
save cooked abbreviation data: cooked/20210609_1919/cooked_best2010_20210609_1919.ab.vocab
### report
# [PRE] line: 50 ...
# [PRE] sent: 50 ...
# [POST] sent: 50 ...
# [POST] train-div: 44 ...
# [POST] valid-div: 4 ...
# [POST] test-div: 2 ...
# [POST] common vocab: 642 ...
# [POST] ne: 7 ...
# [POST] abbreviation: 1 ...
# [POST] train-vocab: 586 ...
# [POST] valid-vocab: 191 ...
# [POST] test-vocab: 89 ...
# [POST] train-ne: 7 ...
# [POST] train-ab: 1 ...
# [POST] word: 4020 ...
# [POST] char: 15977 ...
# [POST] words/sent: min=2 max=177 avg=80.4
# [POST] chars/sent: min=6 max=708 avg=319.54
# [POST] chars/word: min=1 max=20 avg=3.9743781094527364
Elapsed time: 0.109500 sec.

Data divisions (shuffled)

Merged from article, encyclopedia, news, and novel domains, respectively. See src/divs.py for train, validation, and test indices referred to the original data.

Denoise (noisy data based on the original data)

13 cases (34 lines) were found to be noise data (incorrect annotation). Each number refers to the line number in the original data