Home

Awesome

speech.ko

Korean read speech corpus (about 120 hours, 17GB) from National Institute of Korean Language (NIKL)

This repository cleans up NIKL corpus such as voiding unnecessary wav files, matching sampling rate, trimming silence, etc. Details are below.

Location

http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464

https://ithub.korean.go.kr/user/corpus/referenceManager.do

Once you download all corpus files from above, you will find the following zip files.

ls *.zip
3-1 #1 (20대 남성) 4-1.zip  3-2 #1 (30대 남성) 5-1.zip  3-2 #2 (40대 여성) 5-5.zip
3-1 #1 (20대 남성) 4-2.zip  3-2 #1 (30대 남성) 5-2.zip  3-3 #3 (50대 이상 남성여성) 6-1.zip
3-1 #1 (20대 남성) 4-3.zip  3-2 #1 (30대 남성) 5-3.zip  3-3 #3 (50대 이상 남성여성) 6-2.zip
3-1 #1 (20대 남성) 4-4.zip  3-2 #1 (30대 남성) 5-4.zip  3-3 #3 (50대 이상 남성여성) 6-3.zip
3-1 #2 (20대 여성) 5-1.zip  3-2 #1 (30대 남성) 5-5.zip  3-3 #3 (50대 이상 남성여성) 6-4.zip
3-1 #2 (20대 여성) 5-2.zip  3-2 #2 (40대 여성) 5-1.zip  3-3 #3 (50대 이상 남성여성) 6-5.zip
3-1 #2 (20대 여성) 5-3.zip  3-2 #2 (40대 여성) 5-2.zip  3-3 #3 (50대 이상 남성여성) 6-6.zip
3-1 #2 (20대 여성) 5-4.zip  3-2 #2 (40대 여성) 5-3.zip
3-1 #2 (20대 여성) 5-5.zip  3-2 #2 (40대 여성) 5-4.zip

Transcription file contains 19 topics with the following number of sentences.

Topic# of sentence
151
287
369
462
547
654
762
894
960
1073
1142
1228
1339
1427
1517
1635
1719
1827
1940
Total930

When unzipped, you will find speaker ids as follows.

GenderSpeaker IDAge
Femalefv01 to fv2020s
Femalefx01 to fx2040s
Femalefy01 to fy18Older than 50s
Femalefz05 to fz06Older than 50s
Malemv01 to mv2020s
Malemw01 to mw2030s
Malemy01 to my11Older than 50s
Malemz01 to mz09Older than 50s

Dependencies

sox

sox is used to check wav file info such as sampling rate and correct format, and used to convert sampling rate if necessary.

auditok

auditok is used to trim unnecessary silence in the beginning and in the end.

Command

git clone https://github.com/homink/speech.ko.git
cd speech.ko
./run.sh --corpus ${corpus_location}

Cleaned and trimmed wav files will be found in ${corpus_location}/trimmed_data.