Awesome
pingyam-rb - Cantonese romanization conversion in Ruby
This repository contains a Ruby library and example conversion tool that makes use of the open-licensed Pingyam Database to convert between 11 different Cantonese romanization systems and variants.
Features
- Converts to and from any Cantonese romanization scheme (including IPA)
- Can convert single and multiple words / whole lines of romanized text
- Handles mixed input (non-Cantonese text is ignored)
- Converter script ready to use on the command-line -- or include the library in your own code
Included romanization systems
In total 11 Cantonese romanization systems are available for conversion using this library. Each variant is identified by a number (0-10
); this number is also used for identifying the "to" and "from" romanizations to use while converting text.
Index | Name | Chinese | Variant |
---|---|---|---|
0 | Yale | 耶魯拼音 | Tone numbers |
1 | Yale | Tone diacritics | |
2 | Cantonese Pinyin | 教院拼音 | |
3 | S.L. Wong | 黃錫凌 | Tone numbers |
4 | S.L. Wong | Tone diacritics | |
5 | International Phonetic Alphabet | 國際音標 | |
6 | Jyutping | 粵拼 | |
7 | Canton | 廣州拼音 | |
8 | Sidney Lau | 劉錫祥 | |
9 | Penkyamp | 粵語拼音字 | Tone numbers |
10 | Penkyamp | Tone diacritics |
Note: A modified 9-tone Yale system is used by default. However, this library includes a method to convert the Yale transcription to the more traditional 6-tone system (see below for details).
Requirements
This library makes use of the latest version of the Pingyam database, and expects a file called pingyambiu
containing the conversion data to be located in a pingyam
folder in the project root directory. There a number of ways to do this:
- Easiest method: Run the
update_database.rb
script to get the latest version of the script- Instructions: In the project root directory, enter the following command:
./update_database.rb
- If the current version of the database is different than the one on your machine, your local copy will be updated
- Instructions: In the project root directory, enter the following command:
- Download the file directly from the Pingyam project here.
- Make sure to create a directory called
pingyam
in the project root and copy the file to that directory
- Make sure to create a directory called
- If you have
git
installed, you can clone the database into the root project folder using the following command: `git clone https://github.com/kfcd/pingyam.git - Download the Pingyam project into a separate location and create a symlink in the current project directory
There are no other special requirements other than a working version of Ruby.
Usage
This project can be used either as a library (lib_pingyam.rb
) or as a command-line script (convert_pingyam.rb
). Details for both types of usage can be found below.
lib_pingyam
To use the library, make sure to require
the library file, e.g.:
require_relative 'lib_pingyam.rb'
Before you can convert text, you need to initialize a Converter
object:
conv = Converter.new
By default, this initializes a conversion dictionary that works from Yale to any other romanization system.
To use a different source romanization system, just specify the corresponding index number as an argument when initializing the Converter
object, e.g.:
conv = Converter.new(6)
# => This converts from Jyutping to any other system
You can then convert any string of text using the convert_line
method, which takes a string and an integer representing the target romanization system as arguments:
pingyam = "Yale to Jyutping conversion: yut9 yu5 jyun2 wun6"
puts conv.convert_line(pingyam, 6)
# => Yale to Jyutping conversion: jyut6 jyu5 zyun2 wun6
Tip: If you provide 11
as the index number when converting, the string will be translated into all of the available systems sequentially, e.g.:
pingyam = "yut9 yu5 ping3 yam1 fong1 on3 yat7 laam4"
puts conv.convert_line(pingyam, 11)
# => yut9 yu5 ping3 yam1 fong1 on3 yat7 laam4
# => yuht yúh ping yām fōng on yāt làahm
# => jyt9 jy5 ping3 jam1 fong1 on3 jat7 laam4
# => jyt⁹ jy⁵ pɪŋ³ jɐm¹ fɔŋ¹ ɔn³ jɐt⁷ lam⁴
# => _jyt ˏjy ¯pɪŋ 'jɐm 'fɔŋ ¯ɔn 'jɐt ˌlam
# => jyːt˨ jyː˩˧ pʰɪŋ˧ jɐm˥ fɔːŋ˥ ɔːn˧ jɐt˥ laːm˨˩
# => jyut6 jyu5 ping3 jam1 fong1 on3 jat1 laam4
# => yud6 yu5 ping3 yem1 fong1 on3 yed1 lam4
# => yuet⁶ yue⁵ ping³ yam¹ fong¹ on³ yat¹ laam⁴
# => yeud6 yeu5 penk3 yamp1 fong1 on3 yat1 lam4
# => yeùd yeú pênk yämp föng ôn yät lam
The Converter
class has a built-in method for checking if a given string is a valid syllable in any of the available Cantonese romanization systems:
conv = Converter.new
# checks against syllables in Yale (numerals) by default
word = "heung1"
puts conv.check_syllable(word)
# => true
word = "heungg1"
puts conv.check_syllable(word)
# => false
To check syllables in any other romanization system, just specify it when initializing the Converter
class:
conv = Converter.new(6)
# checks valid Jyutping syllables
word = "heung1"
puts conv.check_syllable(word)
# => false
word = "hoeng1"
puts conv.check_syllable(word)
# => true
converting syllables
You can convert individual syllables using the convert_syllable
method of the Converter
class. This method requires two arguments: a string consisting of a single romanized syllable and an integer representing the index number of the target romanization system.
For example, to convert a syllable in Yale into IPA:
conv = Converter.new
p conv.convert_syllable("heung1", 5)
# => "hœːŋ˥"
To convert from a different source transcription system, just provide the corresponding index number when initializing the Converter object.
For example, to convert Jyutping into IPA:
@conv = Converter.new(6)
p @conv.convert_syllable("hoeng1", 5)
# => "hœːŋ˥"
If 11
is passed as the final argument to the convert_syllable
method, it will return an array containing all of the possible transcriptions of the given syllable:
conv = Converter.new
p conv.convert_syllable("heung1", 11)
# => ["heung1", "heūng", "hoeng1", "hœŋ¹", "'hœŋ", "hœːŋ˥", "hoeng1", "hêng1", "heung¹", "heong1", "heöng"]
convert_pingyam
The convert_pingyam.rb
file found in the root directory is a simple script that demonstrates the use of the lib_pingyam
library. It allows for quick and easy conversion between arbitrary Cantonese romanization systems on the command-line.
Basic usage
./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6"
# => This is a test: yuht yúh ping yām jyún wuhn
The above example converts the Cantonese romanization in the provided sentence from Yale (with numerals) into Yale with diacritics. All of the text that is not recognizable as Cantonese romanization (e.g., all of the English text before the colon in the provided sentence) is ignored.
To convert the text into Jyutping instead, just provide the index number for Jyutping (i.e., 6
-- see list above) using the -t
(--target
) option:
./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6" -t 6
# => This is a test: jyut6 jyu5 ping3 jam1 zyun2 wun6
As can be seen, the text has now been converted into Jyutping romanization. Conversion into other systems is equally easy -- just replace 6
above with the index number of the system you wish to use for output.
To convert from a different source romanization system (e.g., to convert from Jyutping to Yale, or from S.L. Wong to Jyutping), provide the source system index number as a parameter using the -s
(--source
) option. The example below converts from Jyutping to Yale with diacritics:
./convert_pingyam.rb -i "This is a test: jyut6 jyu5 ping3 jam1 zyun2 wun6" -s 6 -t 1
# => This is a test: yuht yúh ping yām jyún wuhn
Checking input validity
Invalid romanization syllables can be identified using the -c
(--check
) option. This checks each word in the input string and outputs a list of words that are not recognizable as valid Cantonese syllables in the given romanization system:
./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6" -c
# => This
# => is
# => a
# => test:
The output in the above example contains words that are not valid syllables in Yale romanization (the default, since no other system was specified). To use a different romanization system just provide the appropriate index number using the -s
option. For example, the command below checks for invalid syllables in Jyutping:
./convert_pingyam.rb -i "This is a test: Yut9 yu5 ping3 yam1 jyun2 wun6" -c -s 6
# => This
# => is
# => a
# => test:
# => Yut9
# => yu5
# => yam1
In the example above, the output contains apart from English the Yale syllables Yut9
, yu5
, and yam1
, because these are not valid syllables in Jyutping.
Modifying the output
The output transcription can be further modified using optional command-line flags, for example to convert regular tone numerals to superscript numerals (Unicode), or to revert to the traditional 6-tone Yale system.
- Superscript numerals: Several romanization systems use numerals to indicate tones in Cantonese. These are often represented in superscript form to increase readability of romanized text. To use superscript numerals, use the
-S
(--superscript
) option with any numeral-using transcription system. For example, this will convertsiu2 chak7 si3
tosiu² chak⁷ si³
. - Yale normalization: To use the older 6-tone Yale transcription instead of the default 9-tone modified version, use the
-Y
(--yale
) option. For example, this will convertyat7 jek8 kek9
toyat1 jek3 kek6
.
These modifications can be combined -- the example below both normalizes the Yale transcription and converts the numerals to superscript:
./convert_pingyam.rb -i "yat7 jek8 kek9" -t 0 -YS
# => yat¹ jek³ kek⁶
Options
The following options can be provided to convert_pingyam.rb
to control the conversion process:
-c
,--check
: Check if input contains invalid Cantonese romanization-i
,--input STRING
: Input string to be converted-f
,--filename FILE
: Provide file for conversion-s
,--source INDEX
: Provide index number of romanization to convert from-S
,--superscript
: Print tone numerals as superscript-t
,--target INDEX
: Provide index number of romanization to convert into-Y
,--yale
: Normalize Yale to 6-tone traditional system
To do
Support for traditional 6-tone Yale (with numerals)Conversion of tone numbers to superscript- Optional HTML output
- Handle
files andpipes as input
See also
- Pingyam database
- pingyam-js - Online Cantonese Romanization Converter
- pinyin-rb - Mandarin Chinese transcription conversion in Ruby