Home

Awesome

DaCiDian (大词典)

DaCiDian is an open-sourced lexicon for Chinese Automatic Speech Recognition(ASR)


Design

In mainstream ASR system, lexicon is a core component, that maps word into acoustic modeling units(such as phone). In DaCiDian, we break the mapping into 2 independent layers:

word --> PinYin syllable --> phoneme

The purpose of this design is as follows:


Layer-1: Word -> Syllable Mapper (word_to_pinyin.txt)

examples:

...
裤子    KU_4 ZI_0
好事    HAO_4 SHI_4;HAO_3 SHI_4
教授    JIAO_1 SHOU_4;JIAO_4 SHOU_4
...
语音识别  YU_3 YIN_1 SHI_2 BIE_2
傅里叶变换 FU_4 LI_3 YE_4 BIAN_4 HUAN_4

Layer-2: Syllabel->Phone Mapper (pinyin_to_phone.txt)

pinyin_to_phone is a user-defined mapping from PinYin syllables to target phone set

Take traditional PinYin's Initial-Final structure for example, a mapping should be defined as follows:

A	$0 a
AI	$0 ai
AN	$0 an
ANG	$0 ang
AO	$0 ao
BA	b a
BAI	b ai
BAN	b an
BANG	b ang
BAO	b ao
...
...
...
ZONG	z ong
ZOU	z ou
ZU	z u
ZUAN	z uan
ZUI	z ui
ZUN	z un
ZUO	z uo

Notes