Awesome
nalgene
A natural language generation language, intended for creating training data for intent parsing systems.
Overview
Nalgene generates pairs of sentences and grammar trees by a random (or guided) walk through a grammar file.
-
Sentence: the natural language sentence, e.g. "turn on the light"
-
Tree: a nested list of tokens (an s-expression) generated alongside the sentence, e.g.
( %setDeviceState ( $device.name light ) ( $device.state on ) ) )
Usage
$ python generate.py [template.nlg] [entry] [--key=value] ...
By default, generation walks through the template tree from the entry %
node and chooses phrases and values randomly:
$ python generate.py examples/iot.nlg
> if the temperature in minnesota is equal to 2 then please turn the office light off thanks
( %if
( %condition
( %currentWeather
( $location minnesota ) )
( $operator equal to )
( $number 2 ) )
( %setDeviceState
( $device.name office light )
( $device.state off ) ) )
You can choose an entry point to start generation from:
$ python generate.py examples/iot.nlg getWeather
> tell me what it's like in new york
( %getWeather
( $location new york ) )
You can also supply values from the command line (unspecified values will be randomly chosen):
$ python generate.py examples/iot.nlg getWeather --location tokyo
> what is the weather in tokyo ?
( %getWeather
( $location tokyo ) )
Or from a JSON file:
$ cat command.json
{"entry": "%setDeviceState", "values": {"$device.state": "off", "$device.name": "office light"}}
$ cat command.json | python generate.py examples/iot.nlg
> please turn off the office light
( %setDeviceState
( $device.state off )
( $device.name office light ) )
Syntax
A .nlg nalgene grammar file is a set of sections separated by a blank line. Every section takes this shape:
node_name
token sequence 1
token sequence 2
The indented lines under a node are the node's possible token sequences. Each token in a sequence is either
- a regular word (no prefix),
- a
%phrase
node, - a
$value
node, - a
@ref
node, - or a
~synonym
word.
Each token is added to the output sentence and/or tree during generation, depending on the type.
A standard .nlg file starts with a start phrase %
, which is the default entry point for the generator. The generator may also use a specific entry point.
Phrases
A phrase (%phrase
) is a general set of token sequences. A phrase is potentially recursive, using tokens which represent other phrases (even itself). Each phrase defines one or more possible sequences.
The regular words in a phrase are ignored in the output tree. This makes them useful for defining higher level grammar for the same intent - for example, for different word orders ("turn on the light" vs "turn the light on").
Using this grammar:
%
%greeting
%farewell
%greeting and %farewell
%greeting
hey there
hi
%farewell
goodbye
bye
The generator might output:
> hey there and bye
( %
( %greeting )
( %farewell ) )
Basic generation walkthrough
Here's how the generator arrived at this specific sentence and tree pair:
- Start at start node
%
, with an empty output sentence""
and tree( % )
- Randomly choose a token sequence, in this case the 3rd:
%greeting and %farewell
- The first token is a phrase token
%greeting
, so- Add a new sub-tree
( %greeting )
to the parent tree - Look up the token sequences for
%greeting
- Choose one, in this case
hey there
- For both of these regular word tokens, add to the output sentence (but not to the tree)
- Add a new sub-tree
- At this point the output sentence is
"hey there"
and the parse tree is( % ( %greeting ) )
- The second token is a regular word
"and"
, so add it to the output sentence - The third token is another phrase
%farewell
, so- Add a new sub-tree
( %farewell )
to the parent tree - Look up the token sequences for
%farewell
- Choose one, in this case
bye
- Add to the output sentence
- Now the output sentence is
"hey there and bye"
- Add a new sub-tree
- No more tokens, so we're done
Values
Sometimes you need to capture the specific words in a sentence, for example to capture the location in a sentence like "how is the weather in boston". Values, marked with a dollar sign as $value
, are a type of leaf node that capture the regular word tokens in the tree.
%getWeather
what is the weather in $location
how is the $location weather
$location
boston
san francisco
tokyo
> what is the weather in san francisco
( %getWeather
( $location san francisco ) )
Refs
TODO: Better name for this
As an alternative to the freeform $value
, there is a @ref
leaf node which references a specific value without capturing the words beneath it. This allows you to reference a specific entity, e.g. a specific room or device name, with multiple expansions.
%turnOnLight
turn the %light on
%light
@office_light
@living_room_light
@office_light
office light
light in the office
@living
light in the den
light in the living room
living room light
Synonyms
Synonyms, marked ~synonym
, are output only on the sentence side, and are useful for supplying word variations.
%good
~exclamation this is ~so ~good
~exclamation
wow
omg
~so
so
very
extremely
~good
good
great
wonderful
> wow this is extremely great
( %good )
Optional tokens
Tokens with a ?
at the end will be used only 50% of the time.
%findFood
~find $price? $food ~near $location
> find me sushi in san francisco
( %
( %findFood
( $food sushi )
( $location san francisco ) ) )
> tell me the cheap fried chicken around tokyo
( %
( %findFood
( $price cheap )
( $food fried chicken )
( $location tokyo ) ) )
Passthrough tokens
Tokens with a =
at the end are called "passthrough" tokens and will not be included in the output tree, but their children will be. This is defined at the root level, rather than within a token sequence.
%
~please? %command
%command=
%getTime
%getFact
%getTime
what time is it
what is the time
%getFact=
%getLocationFact
%getPersonFact
%getPersonalFact
In this case, whenever the %command
token is encountered, whatever its children output will be directly added to the tree (as opposed to prefixed with the %command
token), so it will be output as %getTime
or %getFact
. But in fact %getFact
is another passthrough token, so the value of its children will be passed all the way up the tree.
> what is the time
( %
( %getTime ) )
> pretty please what is the population of tokyo
( %
( %getLocationFact
( $location_fact population )
( $location tokyo ) ) )