Home

Awesome

uroman

uroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.<br>     Example (Greek): Νεπάλ → Nepal<br>     Example (Hindi):  नेपाल → nepaal<br>     Example (Urdu):  نیپال → nypal<br>     Example (Chinese): 三万一 → 31000

New Python version: 1.3.1.1 (released on June 27, 2024)<br> Last Perl version: 1.2.8 (released on April 23, 2021)<br> Author: Ulf Hermjakob, USC Information Sciences Institute
Quick links (inside this doc): uroman CLI, import uroman, Old Perl version, change history, reversibility, limitations

(New) Python version

Installation

python3 -m pip install uroman

<a name="cli"></a>

Command Line Interface (CLI)

Examples

python3 -m uroman "Игорь Стравинский"
python3 -m uroman Игорь -l ukr
python3 -m uroman Ντέιβις Καπ -l ell
python3 -m uroman "\u03C0\u03B9" -d
python3 -m uroman -l hin -i mini-test/hin.txt
python3 -m uroman -l fas -i mini-test/fas.txt -o mini-test/fas-rom.jsonl -f edges
python3 -m uroman < mini-test/multi-script.txt > mini-test/multi-script.uroman.txt
python3 -m uroman -h

<b>Note:</b> Using the uroman CLI for single strings can be useful for simple tests, but it is inefficient at scale because data resources are loaded every time. It is more efficient to romanize entire files or to use uroman inside Python as shown further below.<br> <b>Note:</b> The mini-test directory is included in this release. Use command   <code>python3 -m uroman x --verbose</code>   to find it. You can compare your output mini-test/multi-script.uroman.txt with reference output mini-test/multi-script.uroman-ref.txt

uroman.py   Argument Structure Highlights

<table> <tr><td><i>Direct inputs (zero&nbsp;or&nbsp;more)</i></td><td>such as ‘Игорь Стравинский’ and ‘Ντέιβις’ above.</td></tr> <tr><td>-l<br>--lcode</td><td>language code according to <a href="https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes" target="_LCODE">ISO-639-3</a>, e.g. <i>-l ukr</i> for Ukrainian, <i>-l hin</i> for Hindi, <i>-l fas</i> for Persian</td></tr> <tr><td>-i<br>--input_filename</td><td>alternative:&nbsp;<i>stdin</i><br>Note: If both <i>direct inputs</i> and <i>input_filename</i> are given, the romanization results for <i>direct inputs</i> will be written to <i>stderr</i>.</td></tr> <tr><td width="200">-o<br><nobr>--output_filename</nobr></td><td>alternative: <i>stdout</i></td></tr> <tr><td>-f<br>--rom_format</td><td>Output format choices: <ul> <li> -f str &nbsp;&nbsp;&nbsp;&nbsp;&nbsp (best string, default, output format: string) <li> -f edges (best edges, includes offset information, output format: JSONL) <li> -f alts &nbsp;&nbsp;&nbsp;&nbsp; (lattice including alternative edges, output format: JSONL) <li> -f lattice (lattice including alternative and superseded edges, output format: JSONL) </ul></td></tr> <tr><td>-d<br>--decode_unicode</td><td>Decode Unicode escape sequences such as ‘\u03C0\u03B9’ to ‘πι’ which in turn will be romanized to ‘pi’. This is useful for input formats such as JSON.</td></tr> <tr><td>-h<br>--help</td><td>Use this option to see the full argument structure with all options.</td></tr> </table>

<a name="package"></a>

Using uroman inside Python

Examples

import uroman as ur

uroman = ur.Uroman()   # load uroman data (takes about a second or so)
print(uroman.romanize_string('Игорь Стравинский'))
print(uroman.romanize_string('Игорь', lcode='ukr'))
uroman.romanize_file(input_filename='mini-test/multi-script.txt',
                     output_filename='mini-test/multi-script.uroman.jsonl',
                     rom_format=ur.RomFormat.LATTICE)

Methods

uroman = ur.Uroman(data_dir)

This constructor method loads data needed for the romanization of different languages. This constructor call might take about a second (real time) to load all of the romanization data, but it is necessary only once for multiple subsequent romanization calls.

<table> <tr><td>data_dir</td><td>data directory (optional, default: standard uroman data directory)</td></tr> </table> <hr>

uroman.romanize_string(s, lcode, rom_format)

This method takes a string <i>s</i> and returns its romanization in the format according to <i>rom_format</i>: a string (default), or a list of edges.

<table> <tr><td>s</td><td>string to be romanized, e.g. "ایران"</td></tr> <tr><td>lcode</td><td>language code, optional, a 3-letter code such as 'eng' for English (ISO-639-3)</td></tr> <tr><td>rom_format</td><td>Output format choices: <ul> <li> ur.RomFormat.STR &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(best string, default, output format: string) <li> ur.RomFormat.EDGES &nbsp;(best edges, includes offset information, output format: JSONL) <li> ur.RomFormat.ALTS &nbsp;&nbsp;&nbsp;&nbsp;(lattice including alternative edges, output format: JSONL) <li> ur.RomFormat.LATTICE (lattice including alternative and superseded edges, output format: JSONL) </ul> </table> <hr>

uroman.romanize_file(input_filename, output_filename, lcode)

This method romanizes a file <i>input_filename</i> to <i>output_filename</i>.

<table> <tr><td>input_filename</td><td>default: stdin&nbsp;(for input_filename value of <i>None</i>)</td></tr> <tr><td width="200">output_filename</td><td>default: stdout&nbsp;(for output_filename value of <i>None</i>)</td></tr> <tr><td>lcode</td><td>language code (optional), a 3-letter code such as 'eng' for English (ISO-639-3)</td></tr> </table>

<a name="old_perl_version"></a>

Old Perl Version

<sup>Old Perl Version included on GitHub, but not included on PyPI.</sup>

Usage

$ uroman.pl [-l <lang-code>] [--chart] [--no-cache] < STDIN
       where the optional <lang-code> is a 3-letter languages code, e.g. ara, bel, bul, deu, ell, eng, fas,
            grc, ell, eng, heb, kaz, kir, lav, lit, mkd, mkd2, oss, pnt, pus, rus, srp, srp2, tur, uig, ukr, yid.
       --chart specifies chart output (in JSON format) to represent alternative romanizations.
       --no-cache disables caching.

Examples

<sup>Note: Directories text and test are under uroman's root directory on GitHub.</sup>

uroman.pl < text/zho.txt
uroman.pl -l tur < text/tur.txt
uroman.pl -l heb --chart < text/heb.txt
uroman.pl < test/multi-script.txt > test/multi-script.uroman-perl.txt

Identifying the input as Arabic, Belarusian, Bulgarian, English, German, Ancient Greek, Modern Greek, Pontic Greek, Hebrew, Kazakh, Kyrgyz, Latvian, Lithuanian, Macedonian, Ossetian, Persian, Russian, Serbian, Turkish, Ukrainian, Uyghur or Yiddish will improve romanization for those languages as some letters in those languages have different sound values from other languages using the same script (Arabic vs. Persian, Russian vs. Ukrainian, Hebrew vs. Yiddish). No effect for other languages in this version.

Bibliography

Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track. ACL-2018 Best Demo Paper Award. Paper in ACL Anthology | Poster | BibTex

<a name="change_history"></a>

Change History

Changes in version 1.3.1

Changes in version 1.2.8

Changes in version 1.2.7

Changes in version 1.2.6

Changes in version 1.2.5

Changes in version 1.2.4

Changes in version 1.2

Changes in version 1.1 (major upgrade)

Changes in version 1.0 (major upgrade)

Changes in version 0.7 (minor upgrade)

Changes in version 0.6 (minor upgrade)

Changes in version 0.5 (minor upgrade)

Changes in version 0.4 (minor upgrade)

New features in version 0.3

Other features

<a name="reversibility"></a>

Reversibility

<a name="limitations"></a>

Limitations

Acknowledgments

Earlier versions of this tool were based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA8650-17-C-9116, and by research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, Air Force Laboratory, DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.