Home

Awesome

Chinese Text Normalization for Speech Processing

Problem

Search for "Text Normalization"(TN) on Google and Github, you can hardly find open-source projects that are "read-to-use" for text normalization tasks. Instead, you find a bunch of NLP toolkits or frameworks that supports TN functionality. There is quite some work between "support text normalization" and "do text normalization".

Reason

Goal

This project sets up a ready-to-use TN module for Chinese. Since my background is speech processing, this project should be able to handle most common TN tasks, in Chinese ASR text processing pipelines.

Normalizers

  1. supported NSW (Non-Standard-Word) Normalization

    NSW typerawnormalized
    cardinal这块黄金重达324.75克这块黄金重达三百二十四点七五克
    date她出生于86年8月18日,她弟弟出生于1995年3月1日她出生于八六年八月十八日 她弟弟出生于一九九五年三月一日
    digit电影中梁朝伟扮演的陈永仁的编号27149电影中梁朝伟扮演的陈永仁的编号二七一四九
    fraction现场有7/12的观众投出了赞成票现场有十二分之七的观众投出了赞成票
    money随便来几个价格12块5,34.5元,20.1万随便来几个价格十二块五 三十四点五元 二十点一万
    percentage明天有62%的概率降雨明天有百分之六十二的概率降雨
    telephone这是固话0421-33441122<br>这是手机+86 18544139121这是固话零四二一三三四四一一二二<br>这是手机八六一八五四四一三九一二一

    acknowledgement: the NSW normalization codes are based on Zhiyang Zhou's work here

  2. punctuation removal

    For Chinese, it removes punctuation list collected in Zhon project, containing

    • non-stop puncs
      '"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
      
    • stop puncs
      '!?。。'
      

    For English, it removes Python's string.punctuation

  3. multilingual English word upper/lower case conversion since ASR/TTS lexicons usually unify English entries to uppercase or lowercase, the TN module should adapt with lexicon accordingly.

Supported text format

  1. plain text, one sentence per line(.txt)

    今天早饭吃了没
    没吃回家吃去吧
    ...
    

    plain text is default format.

  2. Kaldi's archive format(.ark)

    KALDI_KEY_UTT001    今天早饭吃了没
    KALDI_KEY_UTT002    没吃回家吃去吧
    ...
    

    TN will skip first column key section, normalize latter transcription text

    pass --format ark option to switch to kaldi ark format.

  3. table format(.tsv)

    ID	AUDIO	TEXT
    UTT01	audio/UTT01.wav	今晚8点整中央5播出2020年总决赛
    ...
    

    pass --format tsv option, normalization will apply to TEXT field only.

note: All input text should be UTF-8 encoded.

Run examples

make sure you have python3, python2.X won't work correctly.

sh run.sh in TN dir, and compare raw text and normalized text.

make sure you have thrax installed, and your PATH should be able to find thrax binaries.

sh run.sh in ITN dir. check Makefile for grammar dependency.

possible future work

Since TN is a typical "done is better than perfect" module in context of ASR, and the current state is sufficient for my purpose, I probably won't update this repo frequently.

there are indeed something that needs to be improved:

END