Home

Awesome

Gruut

A tokenizer, text cleaner, and IPA phonemizer for several human languages that supports SSML.

from gruut import sentences

text = 'He wound it around the wound, saying "I read it was $10 to read."'

for sent in sentences(text, lang="en-us"):
    for word in sent:
        if word.phonemes:
            print(word.text, *word.phonemes)

which outputs:

He h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
I ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖

Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.

A subset of SSML is also supported:

from gruut import sentences

ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
    xml:lang="en-US">
<s>Today at 4pm, 2/1/2000.</s>
<s xml:lang="it">Un mese fà, 2/1/2000.</s>
</speak>"""

for sent in sentences(ssml_text, ssml=True):
    for word in sent:
        if word.phonemes:
            print(sent.idx, word.lang, word.text, *word.phonemes)

with the output:

0 en-US Today t ə d ˈeɪ
0 en-US at ˈæ t
0 en-US four f ˈɔ ɹ
0 en-US P p ˈi
0 en-US M ˈɛ m
0 en-US , |
0 en-US February f ˈɛ b j u ˌɛ ɹ i
0 en-US first f ˈɚ s t
0 en-US , |
0 en-US two t ˈu
0 en-US thousand θ ˈaʊ z ə n d
0 en-US . ‖
1 it Un u n
1 it mese ˈm e s e
1 it fà f a
1 it , |
1 it due d j u
1 it gennaio d͡ʒ e n n ˈa j o
1 it duemila d u e ˈm i l a
1 it . ‖

See the documentation for more details.

Installation

pip install gruut

Languages besides English can be added during installation. For example, with French and Italian support:

pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]

The extra pip repo is needed for an updated num2words fork that includes support for more languages.

You may also manually download language files and use put them in $XDG_CONFIG_HOME/gruut/ ($HOME/.config/gruut by default).

gruut will look for language files in the directory $XDG_CONFIG_HOME/gruut/<lang>/ if the corresponding Python package is not installed. Note that <lang> here is the full language name, e.g. de-de instead of just de.

Supported Languages

gruut currently supports:

The goal is to support all of voice2json's languages

Dependencies

Numbers, Dates, and More

gruut can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g., <s lang="...">).

The following types of expressions can be automatically expanded into words by gruut:

Command-Line Usage

The gruut module can be executed with python3 -m gruut --language <LANGUAGE> <TEXT> or with the gruut command (from setup.py).

The gruut command is line-oriented, consuming text and producing JSONL. You will probably want to install jq to manipulate the JSONL output from gruut.

Plain Text

Takes raw text and outputs JSONL with cleaned words/tokens.

echo 'This, right here, is some "RAW" text!' \
   | gruut --language en-us \
   | jq --raw-output '.words[].text'
This
,
right
here
,
is
some
"
RAW
"
text
!

More information is available in the full JSON output:

gruut --language en-us 'More  text.' | jq .

Output:

{
  "idx": 0,
  "text": "More text.",
  "text_with_ws": "More text.",
  "text_spoken": "More text",
  "par_idx": 0,
  "lang": "en-us",
  "voice": "",
  "words": [
    {
      "idx": 0,
      "text": "More",
      "text_with_ws": "More ",
      "leading_ws": "",
      "training_ws": " ",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": "JJR",
      "phonemes": [
        "m",
        "ˈɔ",
        "ɹ"
      ],
      "is_major_break": false,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": false,
      "is_spoken": true,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    },
    {
      "idx": 1,
      "text": "text",
      "text_with_ws": "text",
      "leading_ws": "",
      "training_ws": "",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": "NN",
      "phonemes": [
        "t",
        "ˈɛ",
        "k",
        "s",
        "t"
      ],
      "is_major_break": false,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": false,
      "is_spoken": true,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    },
    {
      "idx": 2,
      "text": ".",
      "text_with_ws": ".",
      "leading_ws": "",
      "training_ws": "",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": null,
      "phonemes": [
        "‖"
      ],
      "is_major_break": true,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": true,
      "is_spoken": false,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    }
  ],
  "pause_before_ms": 0,
  "pause_after_ms": 0
}

For the whole input line and each word, the text property contains the processed input text with normalized whitespace while text_with_ws retains the original whitespace. The text_spoken property only contains words that are spoken, so punctuation and breaks are excluded.

Within each word, there is:

See python3 -m gruut <LANGUAGE> --help for more options.

SSML

A subset of SSML is supported:

Word Roles

During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as gruut:<TAG>. For initialisms and spell-out, the role gruut:letter is used to indicate that e.g., "a" should be spoken as /eɪ/ instead of /ə/.

For en-us, the following additional roles are available from the part-of-speech tagger:

Inline Lexicons

Inline pronunciation lexicons are supported via the <lexicon> and <lookup> tags. gruut diverges slightly from the SSML standard here by allowing lexicons to be defined within the SSML document itself (url is blank or missing). Additionally, the id attribute of the <lexicon> element can be left off to indicate a "default" inline lexicon that does not require a corresponding <lookup> tag.

For example, the following document will yield three different pronunciations for the word "tomato":

<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">

  <lexicon xml:id="test" alphabet="ipa">
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        <!-- Individual phonemes are separated by whitespace -->
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
    <lexeme>
      <grapheme role="fake-role">
        tomato
      </grapheme>
      <phoneme>
        <!-- Made up pronunciation for fake word role -->
        t ə m ˈi t oʊ
      </phoneme>
    </lexeme>
  </lexicon>

  <w>tomato</w>
  <lookup ref="test">
    <w>tomato</w>
    <w role="fake-role">tomato</w>
  </lookup>
</speak>

The first "tomato" will be looked up in the U.S. English lexicon (/t ə m ˈeɪ t oʊ/). Within the <lookup> tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a role attached (selecting a made up pronunciation in this case).

Even further from the SSML standard, gruut allows you to leave off the <lexicon> id entirely. With no id, a <lookup> tag is no longer needed, allowing you to override the pronunciation of any word in the document:

<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">

  <!-- No id means change all words without a lookup -->
  <lexicon>
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
  </lexicon>

  <w>tomato</w>
</speak>

This will yield a pronunciation of /t ə m ˈɑ t oʊ/ for all instances of "tomato" in the document (unless they have a <lookup>).

Intended Audience

gruut is useful for transforming raw text into phonetic pronunciations, similar to phonemizer. Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a carefully chosen inventory.

For each supported language, gruut includes a:

Some languages also include: