Home

Awesome

Number parser and formatter for Dicio assistant

This multilingual library implements methods to extract numbers, dates or durations from text and format numbers, dates or dutations into human-readable words. It is inspired by Mycroft's lingua-franca, from which it borrows some resource files. Currently only Italian (it-it) and English (en-us) are supported, and methods to extract dates or durations are still unavailable (though formatting is).

This repository is part of the Dicio project. Also check out dicio-android, dicio-sentences-compiler and dicio-skill. Open to contributions :-D

Adding a language

You will need to translate some resource files, containing words but also regex, and then adapt some Java code, so be prepared for that.

First of all you need to obtain the language-country pair for the language you want to add. This is important so that who is using the library can choose the language to use. For example, English is en-us and Italian is it-it. Let's call this LANGUAGE_COUNTRY from now on.

Resources

Copy the whole folder numbers/src/main/resources/config/en-us into a new folder numbers/src/main/resources/config/LANGUAGE_COUNTRY. All of the resource files in the new folder should be translated into the new language. DO NOT rename any file, just edit their contents.

*.word files

The files named like ENGLISH_WORD.word should contain only one line with the lowercase translation of the word ENGLISH_WORD in the new language. For example, the English second.word contains "second", the Italian one contains "secondo" and the new language one should contain the translation of "second" into that language. These files are also present in Mycroft's lingua-franca, so copy them from there to save time! Then just check if everything is fine.

date_time.json file

This file contains the data needed to properly format dates, and in part also times, though that's also handled elsewhere. This file is also present in Mycroft's lingua-franca, so copy it from there to save time! Then just check if everything is fine.

Each of the formats provided in this file can (and shall) contain references to other already-formatted strings to be substituted, in the form {FORMATTED_STRING_NAME}. For example, English uses "{formatted_date} at {formatted_time}" as the way to format together both the date and the time.

"weekday", "date", "month", "number"

These JSON objects should contain:

"decade_format", "hundreds_format", "thousand_format"

These JSON objects should contain a numbered list (starting from 1, but it doesn't matter) of (regex, format) pairs, along with a "default" format for when none of the regexes match. Each (regex, format) pair should be a JSON object:

In these JSON object you should put partial year formatting results that will be finally used by "year_format". It doesn't matter if for some language, instead of just formatting e.g. the thousands digit in "thousand_format", you also sometimes format the hundreds digit (this is what happens for English). The important thing is that, at the end, "year_format" spits out correct results.

This is the table of possible FORMATTED_STRING_NAMEs you can use when having to do with "decade_format", "hundreds_format" or "thousand_format". Check out NiceYearSubstitutionTableBuilder.java. The examples are in English and relative to the years "2019" and "3865 b.c.".

<table> <tr> <th><code>FORMATTED_STRING_NAME</code></th> <th>Explanation</th> <th>e.g. "2019"</th> <th>e.g. "-3865"</th> </tr> <tr> <td><code>x</code></td> <td>the last digit of the year</td> <td>"nine"</td> <td>"five"</td> </tr> <tr> <td><code>xx</code></td> <td>the last two digits of the year, if there is a corresponding entry in the <code>"number"</code> list</td> <td>"nineteen"</td> <td>"" (65 is not special and thus has no entry in the <code>"number"</code> list)</td> </tr> <tr> <td><code>x0</code></td> <td>the tens of the year</td> <td>"ten"</td> <td>"sixty"</td> </tr> <tr> <td><code>x_in_x0</code></td> <td>the tens digit of the year</td> <td>"one"</td> <td>"six"</td> </tr> <tr> <td><code>xxx</code></td> <td>the last three digits of the year, if there is a corresponding entry in the <code>"number"</code> list</td> <td>"nineteen" (there is no hundreds digit)</td> <td>"" (865 is not special and thus has no entry in the <code>"number"</code> list)</td> </tr> <tr> <td><code>x00</code></td> <td>the hundreds of the year, if there is a corresponding entry in the <code>"number"</code> list</td> <td>"zero" (probably doesn't make much sense)</td> <td>"" (English pronounces hundreds by putting "hundred" after the unit, so there is no entry for 800 in the <code>"number"</code> list)</td> </tr> <tr> <td><code>x_in_x00</code> and <code>x_in_0x00</code></td> <td>the hundreds digit of the year</td> <td>"zero"</td> <td>"eight"</td> </tr> <tr> <td><code>xx00</code></td> <td>the year with its tens and its units digits set to 0, if there is a corresponding entry in the <code>"number"</code> list</td> <td>"" (2000 is not special and thus has no entry in the <code>"number"</code> list)</td> <td>"" (3800 is not special and thus has no entry in the <code>"number"</code> list)</td> </tr> <tr> <td><code>xx_in_xx00</code></td> <td>the year divided by 100, if there is a corresponding entry in the <code>"number"</code> list</td> <td>"twenty"</td> <td>"" (38 is not special and thus has no entry in the <code>"number"</code> list)</td> </tr> <tr> <td><code>x000</code></td> <td>the thousands of the year, if there is a corresponding entry in the <code>"number"</code> list</td> <td>"" (English pronounces thousands by putting "thousand" after the unit, so there is no entry for 2000 in the <code>"number"</code> list)</td> <td>"" (same reason, for 3000)</td> </tr> <tr> <td><code>x_in_x000</code></td> <td>the thousands digit of the year</td> <td>"two"</td> <td>"three"</td> </tr> <tr> <td><code>x0_in_x000</code></td> <td>the thousands digit of the year, multiplied by 10</td> <td>"twenty"</td> <td>"thirty"</td> </tr> <tr> <td><code>number</code></td> <td>the non-formatted part of the year corresponding to <code>"decade_format"</code> (only last two digits), <code>"hundreds_format"</code> (last three digits) <b>or</b> <code>"thousand_format"</code> (last four digits), to be used for <code>"default"</code> as a fallback</td> <td>"19", "19" or "2019"</td> <td>"65", "865" or "3865"</td> </tr> </table>

"year_format"

This JSON object follows the same structure as "decade_format", "hundreds_format", "thousand_format", but there is also a "bc" field that should contain the translation of the shortened "Before Christ" ("b.c."). In this JSON object you should put how to fully format a number as a year, using the formatted strings already calculated using "decade_format", "hundreds_format" and "thousand_format".

The formats have at their disposal the full table from above plus the following items (which are the ones that should actually be used).

<table> <tr> <th><code>FORMATTED_STRING_NAME</code></th> <th>Explanation</th> <th>e.g. "2019"</th> <th>e.g. "-3865"</th> </tr> <tr> <td><code>formatted_decade</code></td> <td>the decade formatted using <code>"decade_format"</code></td> <td>"nineteen"</td> <td>"sixty five"</td> </tr> <tr> <td><code>formatted_hundreds</code></td> <td>the hundreds formatted using <code>"hundreds_format"</code></td> <td>"zero hundred" (yeah, it doesn't make sense)</td> <td>"eight hundred"</td> </tr> <tr> <td><code>formatted_thousand</code></td> <td>the thousands formatted using <code>"thousand_format"</code></td> <td>"twenty"</td> <td>"thirty eight"</td> </tr> <tr> <td><code>bc</code></td> <td>the translation of "b.c." if the year is before Christ, otherwise an empty string</td> <td>""</td> <td>"b.c."</td> </tr> <tr> <td><code>number</code></td> <td>the non-formatted full-year, to be used for <code>"default"</code> as a fallback</td> <td>"2019"</td> <td>"3865"</td> </tr> </table>

"date_format"

This JSON object should contain a format in these fields: "date_full", "date_full_no_year", "date_full_no_year_month"; and a translation of the field name in these fields: "today", "tomorrow", "yesterday".

The formats have at their disposal this limited table.

<table> <tr> <th><code>FORMATTED_STRING_NAME</code></th> <th>Explanation</th> <th>e.g. "Tuesday, 2022/05/03"</th> </tr> <tr> <td><code>day</code></td> <td>the name of the day in the month</td> <td>"third"</td> </tr> <tr> <td><code>weekday</code></td> <td>the name of the day in the week</td> <td>"tuesday"</td> </tr> <tr> <td><code>month</code></td> <td>the name of the month</td> <td>"may"</td> </tr> <tr> <td><code>formatted_year</code></td> <td>the year fully formatted using <code>"year_format"</code></td> <td>"twenty twenty two"</td> </tr> </table>

"date_time_format"

This JSON object should contain a format in this only field: "date_time".

The format has at its disposal this limited table.

<table> <tr> <th><code>FORMATTED_STRING_NAME</code></th> <th>Explanation</th> <th>e.g. "Tuesday, 2022/05/03 13:22"</th> </tr> <tr> <td><code>formatted_date</code></td> <td>the date fully formatted using <code>"date_format"</code></td> <td>"tuesday, may second, twenty twenty two"</td> </tr> <tr> <td><code>formatted_time</code></td> <td>the time formatted using the java method <code>niceTime</code></td> <td>"one twenty two p.m."</td> </tr> </table>

tokenizer.json

This JSON file contains the information the tokenizer uses to generate the token stream corresponding to an input string.

Test resources

Copy the whole folder numbers/src/test/resources/config/en-us into a new folder numbers/src/test/resources/config/LANGUAGE_COUNTRY. All of the resource files in the new folder are used for testing purposes and should be translated into the new language. DO NOT rename any file, just edit their contents.

date_time_test.json

This file contains some tests for the date_time.json file. This file is also present in Mycroft's lingua-franca, so copy it from there to save time! Then just check if everything is fine.

Each of the JSON objects described below contains a numbered list of tests to run.

"test_nice_year"

These tests are for "year_format". Each test has:

"test_nice_date"

These tests are for "date_format". Each test has:

"test_nice_date_time"

These tests are for "date_time_format". Each test has: