Home

Awesome

pinyin-rb - Mandarin Chinese transcription conversion in Ruby

This repository contains a Ruby library and example conversion tool that makes use of the open-licensed Pinyin Database to convert between 13 different Mandarin Chinese transcription systems and variants.

Features

Included transcription systems

In total 13 Mandarin Chinese transcription systems (or, less accurately, romanization systems -- since not all of them make use of the Roman alphabet) are available for conversion using this library. Each system is identified by a number (0-10); this number is also used for identifying the "to" and "from" transcription systems to use while converting text.

IndexNameChineseVariant
0Hanyu Pinyin漢語拼音Tone numbers
1Hanyu PinyinTone diacritics
2Bopomofo注音符號
3Wade-Giles威妥瑪拼音
4MPS II
5Yale耶魯拼音
6Tongyong通用拼音
7Gwoyeu Romatzyh國語羅馬字
8TOP拼聲拼音
9Palladius俄文拼音
10Character Exemplars漢字示例Traditional
11Character Exemplars漢字示例Simplified
12IPA國際音標

Note: The Hanyu Pinyin variant with tone diacritics uses a middle dot (·) by default to indicate the fifth (neutral) tone. However, this library includes an optional method to print the Pinyin transcription without this dot (see below for details).

Requirements

This library makes use of the latest version of the Pinyin database, and expects a file called pinyinbiao containing the conversion data to be located in a pinyin folder in the project root directory. There a number of ways to do this:

There are no other special requirements other than a working version of Ruby.

Usage

This project can be used either as a library (lib_pinyin.rb) or as a command-line script (convert_pinyin.rb). Details for both types of usage can be found below.

lib_pinyin

To use the library, make sure to require the library file, e.g.:

require_relative 'lib_pinyin.rb'

Before you can convert text, you need to initialize a Converter object:

conv = Py_Converter.new

By default, this initializes a conversion dictionary that works from Hanyu Pinyin to any other transcription system.

To use a different source transcription system, just specify the corresponding index number as an argument when initializing the Converter object, e.g.:

conv = Py_Converter.new(2)
# => This converts from Bopomofo to any other system

You can then convert any string of text using the convert_line method, which takes a string and an integer representing the target transcription system as arguments:

pinyin = "Bopomofo to Hanyu Pinyin conversion: ㄏㄢˋ ㄩˇ ㄆㄧㄣ ㄧㄣ ㄓㄨㄢˇ ㄏㄨㄢˋ"
puts conv.convert_line(pinyin, 1)
# => Bopomofo to Hanyu Pinyin conversion: hàn yǔ pīn yīn zhuǎn huàn

Tip: If you provide 13 as the index number when converting, the string will be translated into all of the available systems sequentially, e.g.:

pinyin = "han4 yu3 pin1 yin1 fang1 an4 yi1 lan3"
puts conv.convert_line(pinyin, 13)
# => han4 yu3 pin1 yin1 fang1 an4 yi1 lan3 
# => hàn yǔ pīn yīn fāng àn yī lǎn 
# => ㄏㄢˋ ㄩˇ ㄆㄧㄣ ㄧㄣ ㄈㄤ ㄢˋ ㄧ ㄌㄢˇ 
# => han⁴ yü³ p'in¹ yin¹ fang¹ an⁴ i¹ lan³ 
# => han4 yu3 pin1 yin1 fang1 an4 yi1 lan3 
# => hàn yǔ pīn yīn fāng àn yī lǎn 
# => hanˋ yuˇ pin yin fang anˋ yi lanˇ 
# => hann yeu pin in fang ann i laan 
# => Han yu PIN YIN FANG An YI lan 
# => хань⁴ юй³ пинь¹ инь¹ фан¹ ань⁴ и¹ лань³ 
# => 汗⁴ 于³ 品¹ 因¹ 方¹ 安⁴ 一¹ 懶³ 
# => 汗⁴ 于³ 品¹ 因¹ 方¹ 安⁴ 一¹ 懒³ 
# => xan˥˩ y˨˩˦ pʰɪn˥˥ ɪn˥˥ fɑŋ˥˥ an˥˩ i˥˥ lan˨˩˦

The Converter class has a built-in method for checking if a given string is a valid syllable in any of the available Mandarin Chinese transcription systems:

conv = Py_Converter.new
# checks against syllables in Hanyu Pinyin (numerals) by default

word = "xiang1"
puts conv.check_syllable(word)
# => true

word = "xiangg1"
puts conv.check_syllable(word)
# => false

To check syllables in any other transcription system, just specify it when initializing the Converter class:

conv = Py_Converter.new(2)
# checks valid Bopomofo syllables

word = "ㄕㄨㄤㄤ"
puts conv.check_syllable(word)
# => false

word = "ㄕㄨㄤ"
puts conv.check_syllable(word)
# => true

converting syllables

You can convert individual syllables using the convert_syllable method of the Converter class. This method requires two arguments: a string consisting of a single romanized syllable and an integer representing the index number of the target transcription system.

For example, to convert a syllable in Hanyu Pinyin into IPA:

conv = Py_Converter.new
p conv.convert_syllable("shuang1", 12)
# => "ʂwɑŋ˥˥"

To convert from a different source transcription system, just provide the corresponding index number when initializing the Converter object.

For example, to convert IPA into Bopomofo:

@conv = Py_Converter.new(12)
p @conv.convert_syllable("ʂwɑŋ˥˥", 2)
# => "ㄕㄨㄤ"

If 13 is passed as the final argument to the convert_syllable method, it will return an array containing all of the possible transcriptions of the given syllable:

conv = Py_Converter.new
p conv.convert_syllable("shuang1", 13)
# => ["shuang1 ", "shuāng ", "ㄕㄨㄤ ", "shuang¹ ", "shuang1 ", "shwāng ", "shuang ", "shuang ", "SHUANG ", "шуан¹ ", "雙¹ ", "双¹ ", "ʂwɑŋ˥˥"]

convert_pinyin

The convert_pinyin.rb file found in the root directory is a simple script that demonstrates the use of the lib_pinyin library. It allows for quick and easy conversion between arbitrary Mandarin Chinese transcription systems on the command-line.

Basic usage

./convert_pinyin.rb -i "This is a test: Han4 yu3 pin1 yin1 fang1 an4 yi1 lan3"
# => This is a test: hàn yǔ pīn yīn fāng àn yī lǎn

The above example converts the Mandarin Chinese romanization in the provided sentence from Hanyu Pinyin (with numerals) into Hanyu Pinyin with diacritics. All of the text that is not recognizable as Mandarin Chinese romanization (e.g., all of the English text before the colon in the provided sentence) is ignored.

To convert the text into Bopomofo instead, just provide the index number for Bopomofo (i.e., 2 -- see list above) using the -t (--target) option:

./convert_pinyin.rb -i "This is a test: Han4 yu3 pin1 yin1 fang1 an4 yi1 lan3" -t 2
# => This is a test: ㄏㄢˋ ㄩˇ ㄆㄧㄣ ㄧㄣ ㄈㄤ ㄢˋ ㄧ ㄌㄢˇ

As can be seen, the text has now been converted into Bopomofo orthography. Conversion into other systems is equally easy -- just replace 2 above with the index number of the system you wish to use for output.

To convert from a different source transcription system (e.g., to convert from Wade-Giles to Yale, or from Yale to Hanyu Pinyin), provide the source system index number as a parameter using the -s (--source) option. The example below converts from Bopomofo to Hanyu Pinyin with diacritics:

./convert_pinyin.rb -i "This is a test: ㄏㄢˋ ㄩˇ ㄆㄧㄣ ㄧㄣ ㄈㄤ ㄢˋ ㄧ ㄌㄢˇ" -s 2
# => This is a test: hàn yǔ pīn yīn fāng àn yī lǎn

Checking input validity

Invalid syllables can be identified using the -c (--check) option. This checks each word in the input string and outputs a list of words that are not recognizable as valid Mandarin Chinese syllables in the given transcription system:

./convert_pinyin.rb -i "This is a test: Han4 yu3 pin1 yin1 fang1 an4 yi1 lan3" -c
# => This
# => is
# => a
# => test:

The output in the above example contains words that are not valid syllables in Hanyu Pinyin romanization (the default, since no other system was specified). To use a different transcription system just provide the appropriate index number using the -s option. For example, the command below checks for invalid syllables in Wade-Giles:

./convert_pinyin.rb -i "This is a test: han⁴ yü³ p'in¹ yin¹ fang¹ an⁴ i¹ laan³" -c -s 3
# => This
# => is
# => a
# => test:
# => laan³

In the example above, the output contains (apart from English) the syllable laan³, because it is not a valid syllable in the Wade-Giles system.

Modifying the output

The output transcription can be further modified using optional command-line flags, for example to convert regular tone numerals to superscript numerals (Unicode), or to revert to the dotless-Hanyu Pinyin transcription.

Options

The following options can be provided to convert_pinyin.rb to control the conversion process:

To do

See also

License