Awesome

`t-regex`: matchers and grammars using tree regular expressions

t-regex defines a series of combinators to express tree regular expressions over any Haskell data type. In addition, with the use of some combinators (and a bit of Template Haskell), it defines nice syntax for using this tree regular expressions for matching and computing attributes over a term.

Defining your data type

In order to use t-regex, you need to define you data type in a way amenable to inspection. In particular, it means that instead of a closed data type, you need to define a pattern functor in which recursion is described using the final argument, and which should instantiate the Generic1 type class (this can be done automatically if you are using GHC).

For example, the following block of code defines a Tree' data type in the required fashion:

{-# LANGUAGE DeriveGeneric #-}
import GHC.Generics

data Tree' f = Leaf'
             | Branch' { elt :: Int, left :: f, right :: f }
             deriving (Generic1, Show)

Notice the f argument in the place where Tree' would usually be found. In addition, we have declared the constructors using ' at the end, but we will get rid of them soon.

Now, if you want to create terms, you need to close the type, which essentially makes the f argument refer back to Tree'. You do so by using Fix:

type Tree = Fix Tree'

However, this induces the need to add explicit Fix constructors at each level. To alleviate this problem, let's define pattern synonyms, available from GHC 7.8 on:

pattern Leaf = Fix Leaf'
pattern Branch n l r = Fix (Branch' n l r)

From an outsider point of view, now your data type is a normal one, with Leaf and Branch as its constructors:

aTree = Branch 2 (Branch 2 Leaf Leaf) Leaf

Tree regular expressions

Tree regular expressions are parametrized by a pattern functor: in this way they are flexible enough to be used with different data types, while keeping our loved Haskell type safety.

The available combinators to build regular expressions follow the syntax of Tree Automata Techniques and Applications, Chapter 2.

Emptiness

The expressions empty_ and none do not match any value. They can be used to signal an error branch on an expressions.

Whole language

You can match any possible term using any_. It is commonly use in combination with capture to extract information from a term.

Choice

A regular expression of the form r1 <||> r2 tries to match r1, and if it not possible, it tries to do so with r2. Note than when capturing, the first regular expression is given priority.

Injection

Of course, at some point you would like to check whether some term has a specific shape. In our case, this means that it has been constructed in some specific way. In order to do so, you need to inject the constructor as a tree regular expression. When doing so, you can use the same syntax as usual, but where at the recursive positions you write regular expressions again.

Let's make it clearer with an example. In our initial definition we had a constructor Branch' with type:

Branch' :: Int -> f -> f -> Tree' f

Once you inject it using inj, the resulting constructor becomes:

inj . Branch' :: Int -> Regex' c f -> Regex c' f -> Tree' (Regex' c f)

Notice that fields with no f do not change their type. Now, here is how you would represent a tree whose top node is a 2:

topTwo = inj (Branch' 2 any_ any_)

In some cases, you don't want to check the value of a certain position which is not recursive. In that case, you cannot use any_, since we are not talking about the type being built upon. For that case, you may use the special value __.

For example, here is how you would represent the shape of a tree which has at least one branch point:

someBranching = inj (Branch' __ any_ any_)

Iteration and concatenation

Iteration in tree regular expressions is not as easy as in word languages. The reason is that iteration may occur several times, and in different positions. For that reason, we need to introduce the notion of hole: a hole is a placeholder where iteration takes place.

In t-regex hole names are represented as lambda-bound variables. Then, you can use any of the functions square, var or # to indicate a position where the placeholder should be put. Iteration is then indicated by a call to iter or its post-fix version ^*.

The following two are ways to indicate a Tree where all internal nodes include the number 2:

{-# LANGUAGE PostfixOperators #-}

regex1 = Regex $
           iter $ \k ->
                  inj (Branch' 2 (square k) (square k))
             <||> inj Leaf'

regex2 = Regex $ ( (\k -> inj (Branch' 2 (k#) (k#)) <||> inj Leaf')^* )

Notice that the use of PostfixOperators enables a much terse language.

Iteration is an instance of a more general construction called concatenation, where a hole in an expression is filled by another given expression. The general shape of those are:

(\k -> ... (k#) ...) <.> r  -- k is replaced by r in the first expression

Matching and capturing

You can check whether a term t is matched by a regular expression r by simply using:

r `matches` t

However, the full power of tree regular expression come at the moment you start capturing subterms of your tree. A capture group is introduced in a expression by either capture or <<-, and all of them appearing in a expression should be from the same type. For example, we can refine the previous regex2 to capture leaves:

regex2 = Regex $ ( (\k -> inj (Branch' 2 (k#) (k#)) <||> "leaves" <<- inj Leaf')^* )

To check whether a term matches a expression and capture the subterms, you need to call match. The result is of type Maybe (Map c (m (Fix f))). Let's see what it means:

The outermost Maybe indicates whether the match is successful or not. A value of Nothing indicates that the given term does not match the regular expression,
Capture groups are returned as a Map, whose keys are those given at capture or <<-. For that reason, you need capture identifiers to be instances of Ord,
Finally, you have a choice about how to save the found subterms, given by the Alternative m. Most of the time, you will make m = [], which means that all matches of the group will be returned as a list. Other option is using m = Maybe, where only the first match is returned.

Tree regular expression patterns

Capturing is quite simple, but comes with a problem: it becomes easy to mistake the name of a capture group, so your code becomes quite brittle. For that reason, t-regex includes matchers which you can use at the same positions where pattern matching is usually done, and which take care of linking capture groups to variable names, making it impossible to mistake them.

To use them, you need to import Data.Regex.TH. Then, a quasi-quoter named rx is available to you. Here is an example:

{-# LANGUAGE QuasiQuotes #-}

example :: Tree -> [Tree]
example [rx| (\k -> inj (Branch' 2 (k#) (k#)) <||> leaves <<- inj Leaf')^* |]
           = leaves
example _  = []

The name of the capture group, leaves, is now available in the body of the example function. There is no need to look inside maps, this is all taken care for you.

Note that when using the rx quasi-quoter, you have no choice about the Alternative m to use when matching. Instead, you always get as value of each capture group a list of terms.

For those who don't like using quasi-quotation, t-regex provides a less magical version called with. In this case, you need to introduce the variables in a explicit manner, and then pattern match on a tuple wrapped inside a Maybe. The previous example would be written as:

{-# LANGUAGE ViewPatterns #-}

example :: Tree -> [Tree]
example with (\leaves -> Regex $ iter $ \k -> inj (Branch' 2 (k#) (k#))
                                              <||> leaves <<- inj Leaf' )
             -> Just leaves
           = leaves
example _  = []

Notice that the pattern is always similar with (\v1 v2 ... -> regular expression) -> Just (v1,v2,...).

Random generation

You can use t-regex to generate random values of a type which satisfy a certain tree regular expression. Of course, you might always generate random values and then check that they match the given expression, but this is usually very costly and maybe even statistically impossible. Instead, you should use arbitraryFromRegex.

instance Arbitrary Tree where
  arbitrary = frequency
    [ (1, return Leaf)
    , (5, Branch <$> arbitrary <*> arbitrary <*> arbitrary) ]

> arbitraryFromRegex regex2

Note that in the previous example we also gave an instance declaration for Arbitrary. This class comes from the QuickCheck package, and is needed to generate unconstrained random values for the case in which any_ is found.

Sometimes you may not be able or want to write such an instance. In that case, you can use arbitraryFromRegexAndGen, which takes an additional argument from which any_ values are generated.

Attribute grammars

Attribute grammars are a powerful way to perform computations over a term. The main idea is that each node in your term (when seen as a tree) is traversed, and two sets of information are recorded at each point:

Inherited attributes go from parent to children. When describing a grammar, each node needs to specify the value of inherited attributes of all their children,
Synthesized attributes flow in the other direction. At each node, you need to described how to get the value of each synthesized attribute based on your inherited attributes and the synthesized attributes of children.

The most performant attribute grammar compilers, such as UUAGC only allow deciding which rule to apply on a node depending on their topmost constructor. With t-regex you can look as deep as you want to take this decision (but of course, performance will suffer if you do this very often).

Here is an example of a grammar which computes a graphical representation of a Tree plus its number of leaves:

grammar = [
    rule $ \l r ->
     inj (Branch' 2 (l <<- any_) (r <<- any_)) ->> do
       (lText,lN) <- use (at l . syn)
       (rText,rN) <- use (at r . syn)
       this.syn._1 .= "(" ++ lText ++ ")-SPECIAL-(" ++ rText ++ ")"
       this.syn._2 .= lN + rN
  , rule $ \l r ->
     inj (Branch' __ (l <<- any_) (r <<- any_)) ->>> \(Branch e _ _) -> do
	   check $ e >= 0
       (lText,lN) <- use (at l . syn)
       (rText,rN) <- use (at r . syn)
       this.syn._1 .= "(" ++ lText ++ ")-" ++ show e ++ "-(" ++ rText ++ ")"
       this.syn._2 .= lN + rN
  , rule $ inj Leaf' ->> do
       this.syn._1 .= "leaf"
       this.syn._2 .= Sum 1
  ]

Let's dissect it part by part.

First of all, a grammar is made of a series of rules. Each rule follows the same schema:

rule $ \v1 ... vn ->
  regular expression ->> do
    actions

rule is the constant part which prefixes every rule. Then, you have the set of variables which will be used to capture information from the term, in a similar way to previous section. After that you have the tree regular expression the term needs to match. Finally, and separated by ->>, you find the actions to perform when this rule is selected.

Two small extensions are shown in the second rule. By default, ->> does not give you access to the matched term. However, if you need to access some of its information (for example, because you used __, as in this case), you can use the alternative version ->>> which gives this as an argument. The second extension is the use of check to pinpoint a logical condition which is not captured by the regular expression itself.

The syntax for the actions relies heavily in lenses and operators from the lens package. In particular, you have four lenses:

this.inh gives access to the inherited attributes of the node,
at n . syn gives access to the synthesized attributes of children,
this.syn is where you set the synthesized attributes of your node,
at n . inh is where you set the inherited attributes of children,

Furthermore, you can combine those with lenses over your inherited and synthesized attribute data type to have more lightweight syntax. In our case, the synthesized attributes are a tuple (String, Sum Int), so we access them with _1 and _2, as shown in the example.

The general rule is that you read values using use, and set values via .=. If some value is not set, it defaults to the empty element of your synthesized type, or to the value of parent node in the case of inherited attributes.