Home

Awesome

pingyam-rb - Cantonese romanization conversion in Ruby

This repository contains a Ruby library and example conversion tool that makes use of the open-licensed Pingyam Database to convert between 11 different Cantonese romanization systems and variants.

Features

Included romanization systems

In total 11 Cantonese romanization systems are available for conversion using this library. Each variant is identified by a number (0-10); this number is also used for identifying the "to" and "from" romanizations to use while converting text.

IndexNameChineseVariant
0Yale耶魯拼音Tone numbers
1YaleTone diacritics
2Cantonese Pinyin教院拼音
3S.L. Wong黃錫凌Tone numbers
4S.L. WongTone diacritics
5International Phonetic Alphabet國際音標
6Jyutping粵拼
7Canton廣州拼音
8Sidney Lau劉錫祥
9Penkyamp粵語拼音字Tone numbers
10PenkyampTone diacritics

Note: A modified 9-tone Yale system is used by default. However, this library includes a method to convert the Yale transcription to the more traditional 6-tone system (see below for details).

Requirements

This library makes use of the latest version of the Pingyam database, and expects a file called pingyambiu containing the conversion data to be located in a pingyam folder in the project root directory. There a number of ways to do this:

There are no other special requirements other than a working version of Ruby.

Usage

This project can be used either as a library (lib_pingyam.rb) or as a command-line script (convert_pingyam.rb). Details for both types of usage can be found below.

lib_pingyam

To use the library, make sure to require the library file, e.g.:

require_relative 'lib_pingyam.rb'

Before you can convert text, you need to initialize a Converter object:

conv = Converter.new

By default, this initializes a conversion dictionary that works from Yale to any other romanization system.

To use a different source romanization system, just specify the corresponding index number as an argument when initializing the Converter object, e.g.:

conv = Converter.new(6)
# => This converts from Jyutping to any other system

You can then convert any string of text using the convert_line method, which takes a string and an integer representing the target romanization system as arguments:

pingyam = "Yale to Jyutping conversion: yut9 yu5 jyun2 wun6"
puts conv.convert_line(pingyam, 6)
# => Yale to Jyutping conversion: jyut6 jyu5 zyun2 wun6

Tip: If you provide 11 as the index number when converting, the string will be translated into all of the available systems sequentially, e.g.:

pingyam = "yut9 yu5 ping3 yam1 fong1 on3 yat7 laam4"
puts conv.convert_line(pingyam, 11)
# => yut9 yu5 ping3 yam1 fong1 on3 yat7 laam4 
# => yuht yúh ping yām fōng on yāt làahm 
# => jyt9 jy5 ping3 jam1 fong1 on3 jat7 laam4 
# => jyt⁹ jy⁵ pɪŋ³ jɐm¹ fɔŋ¹ ɔn³ jɐt⁷ lam⁴ 
# => _jyt ˏjy ¯pɪŋ 'jɐm 'fɔŋ ¯ɔn 'jɐt ˌlam 
# => jyːt˨ jyː˩˧ pʰɪŋ˧ jɐm˥ fɔːŋ˥ ɔːn˧ jɐt˥ laːm˨˩ 
# => jyut6 jyu5 ping3 jam1 fong1 on3 jat1 laam4 
# => yud6 yu5 ping3 yem1 fong1 on3 yed1 lam4 
# => yuet⁶ yue⁵ ping³ yam¹ fong¹ on³ yat¹ laam⁴ 
# => yeud6 yeu5 penk3 yamp1 fong1 on3 yat1 lam4 
# => yeùd yeú pênk yämp föng ôn yät lam

The Converter class has a built-in method for checking if a given string is a valid syllable in any of the available Cantonese romanization systems:

conv = Converter.new
# checks against syllables in Yale (numerals) by default

word = "heung1"
puts conv.check_syllable(word)
# => true

word = "heungg1"
puts conv.check_syllable(word)
# => false

To check syllables in any other romanization system, just specify it when initializing the Converter class:

conv = Converter.new(6)
# checks valid Jyutping syllables

word = "heung1"
puts conv.check_syllable(word)
# => false

word = "hoeng1"
puts conv.check_syllable(word)
# => true

converting syllables

You can convert individual syllables using the convert_syllable method of the Converter class. This method requires two arguments: a string consisting of a single romanized syllable and an integer representing the index number of the target romanization system.

For example, to convert a syllable in Yale into IPA:

conv = Converter.new
p conv.convert_syllable("heung1", 5)
# => "hœːŋ˥"

To convert from a different source transcription system, just provide the corresponding index number when initializing the Converter object.

For example, to convert Jyutping into IPA:

@conv = Converter.new(6)
p @conv.convert_syllable("hoeng1", 5)
# => "hœːŋ˥"

If 11 is passed as the final argument to the convert_syllable method, it will return an array containing all of the possible transcriptions of the given syllable:

conv = Converter.new
p conv.convert_syllable("heung1", 11)
# => ["heung1", "heūng", "hoeng1", "hœŋ¹", "'hœŋ", "hœːŋ˥", "hoeng1", "hêng1", "heung¹", "heong1", "heöng"]

convert_pingyam

The convert_pingyam.rb file found in the root directory is a simple script that demonstrates the use of the lib_pingyam library. It allows for quick and easy conversion between arbitrary Cantonese romanization systems on the command-line.

Basic usage

./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6"
# => This is a test: yuht yúh ping yām jyún wuhn

The above example converts the Cantonese romanization in the provided sentence from Yale (with numerals) into Yale with diacritics. All of the text that is not recognizable as Cantonese romanization (e.g., all of the English text before the colon in the provided sentence) is ignored.

To convert the text into Jyutping instead, just provide the index number for Jyutping (i.e., 6 -- see list above) using the -t (--target) option:

./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6" -t 6
# => This is a test: jyut6 jyu5 ping3 jam1 zyun2 wun6

As can be seen, the text has now been converted into Jyutping romanization. Conversion into other systems is equally easy -- just replace 6 above with the index number of the system you wish to use for output.

To convert from a different source romanization system (e.g., to convert from Jyutping to Yale, or from S.L. Wong to Jyutping), provide the source system index number as a parameter using the -s (--source) option. The example below converts from Jyutping to Yale with diacritics:

./convert_pingyam.rb -i "This is a test: jyut6 jyu5 ping3 jam1 zyun2 wun6" -s 6 -t 1
# => This is a test: yuht yúh ping yām jyún wuhn

Checking input validity

Invalid romanization syllables can be identified using the -c (--check) option. This checks each word in the input string and outputs a list of words that are not recognizable as valid Cantonese syllables in the given romanization system:

./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6" -c
# => This
# => is
# => a
# => test:

The output in the above example contains words that are not valid syllables in Yale romanization (the default, since no other system was specified). To use a different romanization system just provide the appropriate index number using the -s option. For example, the command below checks for invalid syllables in Jyutping:

./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6" -c -s 6
# => This
# => is
# => a
# => test:
# => Yut9
# => yu5
# => yam1

In the example above, the output contains apart from English the Yale syllables Yut9, yu5, and yam1, because these are not valid syllables in Jyutping.

Modifying the output

The output transcription can be further modified using optional command-line flags, for example to convert regular tone numerals to superscript numerals (Unicode), or to revert to the traditional 6-tone Yale system.

These modifications can be combined -- the example below both normalizes the Yale transcription and converts the numerals to superscript:

./convert_pingyam.rb -i "yat7 jek8 kek9" -t 0 -YS
# => yat¹ jek³ kek⁶

Options

The following options can be provided to convert_pingyam.rb to control the conversion process:

To do

See also

License