Home

Awesome

RegexParser

RegexParser is a regular expression engine written in C# that:

<h2>Contents</h2> <toc> </toc>

Structure

RegexParser works in three phases:

  1. Parsing the regex pattern, which results in an Abstract Syntax Tree (AST)
  2. Transforming the AST
  3. Pattern matching on the target string using the AST

Phases 1 and 2 happen only once for a given regex. Phase 3 may happen multiple times, for different target strings.

Phase 1: Parsing the Regex Pattern

The Parser Type

The Parser type is defined as:

public delegate Result<TToken, TTree> Parser<TToken, TTree>(IConsList<TToken> consList);

public class Result<TToken, TTree>
{
    public Result(TTree tree, IConsList<TToken> rest)
    {
        Tree = tree;
        Rest = rest;
    }
    public TTree Tree { get; private set; }
    public IConsList<TToken> Rest { get; private set; }
}

public interface IConsList<T>
{
    T Head { get; }
    IConsList<T> Tail { get; }

    bool IsEmpty { get; }
    int Length { get; }
}

A parser is simply a function that takes a list of tokens (e.g., characters), and returns a syntax tree and the list of unconsumed tokens. To indicate failure to match, it will return null.

The Parser type is equivalent to the following Haskell type:

newtype Parser token tree = Parser ([token] -> Maybe (tree, [token]))

NOTE: The idea (and syntax) of parser combinators came from these articles:

In the articles, the type is defined similarly to:

newtype Parser token tree = Parser ([token] -> [(tree, [token])])

This allows the parser to be ambiguous (to parse a string in multiple ways). The parser will return either a list of one or more "success" alternatives, or an empty list to indicate failure.

As the regex syntax is non-ambigious, the Maybe definition was preferred.

Parser Combinators in C#

Parser combinators are higher-order functions that can be used to combine basic parsers to construct parsers for more complex rules.

The following parser combinators are defined (see source for descriptions):

For example, the Choice combinator is defined as:

// Try to apply the parsers in the 'choices' list in order, until one succeeds.
// Return the tree returned by the succeeding parser.
public static Parser<TToken, TTree> Choice<TTree>(params Parser<TToken, TTree>[] choices)
{
    return consList =>
    {
        foreach (var parser in choices)
        {
            var result = parser(consList);
            if (result != null)
                return result;
        }
        return null;
    };
}

Beside combinators, there are also a number of "primitive" character parsers (see source for descriptions):

Each of these will consume exactly one character.

The Parser Monad in C#

LINQ, the data querying subset of C#, offers a form of syntactic sugar that allows writing code similar to Haskell do notation. This greatly simplifies the writing of more complex parsers.

For example, let's say we want to write a parser called naturalNum, which reads a sequence of digits and returns an int as syntactic tree. Using parser combinators and primitives from the previous section (i.e., Many1 and Digit), we can define it like this:

Parser<char, int> naturalNum =
    consList =>
    {
        var result = Many1(Digit)(consList);

        if (result != null)
            return new Result<char, int>(readInt(result.Tree), result.Rest);
        else
            return null;
    };

Because this is such a common pattern, we can define a helper extension method, Select():

public static Parser<TToken, TTree2> Select<TToken, TTree, TTree2>(
                                        this Parser<TToken, TTree> parser,
                                        Func<TTree, TTree2> selector)
{
    return consList =>
    {
        var result = parser(consList);

        if (result != null)
            return new Result<TToken, TTree2>(selector(result.Tree), result.Rest);
        else
            return null;
    };
}

Now the parser can be written more simply as:

Parser<char, int> naturalNum = Many1(Digit).Select(ds => readInt(ds));

A Select() method with a signature similar to ours has special meaning for LINQ. Taking advantage of that, we can rewrite the parser in a syntactic sugar form, which will be translated (desugared) by the C# preprocessor to exactly the same code as above:

Parser<char, int> naturalNum = from ds in Many1(Digit)
                               select readInt(ds);

Notice the similarity with do notation in Haskell:

naturalNum = do ds <- many1 digit
                return (readInt ds)

So far, we could only use the from keyword once per expression. By also defining a LINQ-related method called SelectMany() (see source), we become able to build parser expressions that use from more than once. For example, if we want to parse an integer (prefixed by an optional minus sign), we can write:

Parser<char, int> integerNum = from sign in Option('+', Char('-'))
                               from ds in Many1(Digit)
                               let s = (sign == '-' ? -1 : 1)
                               select s * readInt(ds);

This is equivalent to the following Haskell parser:

integerNum = do sign <- option '+' (char '-')
                ds <- many1 digit
                let s = if sign == '-' then -1 else 1
                return (s * (readInt ds))

Parsing the Regex Language

Using parser combinators and primitives, as well as the syntactic sugar notation described, we can write a parser for the whole regex language (as supported by RegexParser) in less than 150 lines of code (see source).

For example, the Quantifier parser, which accepts suffixes <code>*</code>, <code>+</code>, <code>?</code>, <code>{n}</code>, <code>{n,}</code>, <code>{n,m}</code> (greedy quantifiers), and <code>*?</code>, <code>+?</code>, <code>??</code>, <code>{n}?</code>, <code>{n,}?</code>, <code>{n,m}?</code> (lazy quantifiers), is defined like this:

var RangeQuantifierSuffix = Between(Char('{'),
                                    Char('}'),

                                    from min in NaturalNum
                                    from max in
                                        Option(min, PrefixedBy(Char(','),
                                                               Option(null, Nullable(NaturalNum))))
                                    select new { Min = min, Max = max });

Quantifier = from child in Atom
             from quant in
                 Choice(
                     from _q in Char('*') select new { Min = 0, Max = (int?)null },
                     from _q in Char('+') select new { Min = 1, Max = (int?)null },
                     from _q in Char('?') select new { Min = 0, Max = (int?)1 },
                     RangeQuantifierSuffix)
             from greedy in
                 Option(true, from _c in Char('?')
                              select false)
             select (BasePattern)new QuantifierPattern(child, quant.Min, quant.Max, greedy);

The more complex parsers are built from more simple ones. The topmost parser is called Regex. The result of parsing is a tree of pattern objects (derived from class BasePattern). Here are the main pattern classes (see sources):

All pattern classes are immutable.

Phase 2: Transforming the Abstract Syntax Tree

The following transforms are performed (see sources):

Phase 3: Pattern Matching on the Target String

Matching without Backtracking

The simplest way to parse the target string is to build a parser from the AST using combinators we already have. This, however, would be a non-backtracking parser, as our Parser type does not allow returning multiple "success" alternatives.

Here is a recursive definition (see source), based on the Sequence, Count, Choice and Satisfy combinators:

private Parser<char, string> createParser(BasePattern pattern)
{
    if (pattern == null)
        throw new ArgumentNullException("pattern.", "Pattern is null when creating match parser.");

    switch (pattern.Type)
    {
        case PatternType.Group:
            return from vs in
                       CharParsers.Sequence(((GroupPattern)pattern).Patterns
                                                                   .Select(p => createParser(p)))
                   select vs.JoinStrings();

        case PatternType.Quantifier:
            QuantifierPattern quant = (QuantifierPattern)pattern;
            return from vs in CharParsers.Count(quant.MinOccurrences,
                                                quant.MaxOccurrences,
                                                createParser(quant.ChildPattern))
                   select vs.JoinStrings();

        case PatternType.Alternation:
            return CharParsers.Choice(((AlternationPattern)pattern).Alternatives
                                                                   .Select(p => createParser(p))
                                                                   .ToArray());

        case PatternType.Char:
            return from c in CharParsers.Satisfy(((CharPattern)pattern).IsMatch)
                   select new string(c, 1);

        default:
            throw new ApplicationException(
                string.Format("ExplicitDFAMatcher: unrecognized pattern type ({0}).",
                              pattern.GetType().Name));
    }
}

As a further drawback, this parser does not (and cannot) deal with anchor patterns.

Matching with Backtracking

To understand the need for backtracking, let's consider a simple example: we want to find all the words that end with <code>t</code> within the target <code>"a lot of important text"</code>.

We start with the most logical pattern: <code>\w+t</code> (any word character, repeated one or more times, followed by <code>t</code>). This doesn't work as expected: <code>\w+</code> will match <code>lot</code> (including the final <code>t</code>), so the <code>t</code> in the pattern won't have anything left to match.

We then try the pattern <code>[\w-[t]]+t</code> (any word character except <code>t</code>, repeated one or more times, followed by <code>t</code>). This will match <code>lot</code> correctly, but then it will match <code>import</code>, <code>ant</code>, and <code>ext</code>. Not correct.

Next we try <code>\w+?t</code> (any word character, repeated one or more times lazily, followed by <code>t</code>). This produces the same result as above. (Not to mention the fact that lazy quantifiers actually need backtracking in order to work.)

What would work is the following: suppose that <code>\w+</code> matches every word character including the <code>t</code> (in <code>lot</code>); then, when the parser notices that the next part of the pattern (i.e., <code>t</code>) does not match, it tries to backtrack one match of the quantifier's subpattern (one word character); so now it has matched only <code>lo</code>, and the <code>t</code> in the pattern will match.

Possible complications:

See the implementation here.

Testing

The main unit tests are matching tests for the supported regex patterns: char escapes, char classes, quantifiers (greedy/lazy), alternations, anchors, and global regex options (see sources).

The tests make use of RegexAssert.AreMatchesSameAsMsoft() (see source) as the main way of asserting correctness. This method compares the result returned by RegexParser (the whole collection of matches, including position in the target string) with the result of the standard .NET implementation (System.Text.RegularExpressions.Regex.Matches()).

There are also unit tests for AST transforms and pattern classes.

Performance

Performance is about one order of magnitude slower than the standard .NET implementation, due in no small part to running on a virtual machine, the .NET Runtime (whereas the Microsoft version is closer to the machine, being written in C).

However, care has been taken to ensure that the performance profile is similar to the standard .NET implementation. So, if the .NET implementation runs in O(n) time, the RegexParser version will also run in O(n) time.

There are a number of "informal" performance tests, run from the console, with performance figures attached to code as comments. For example (see source):

const int times = 1, inputLength = 1000000;    // or 10000000

string alphabet = "abcdefghijklmnopqrstuvwxyz";
string lowercaseChars = EnumerablePerformanceTests.RepeatChars(alphabet, inputLength)
                                                  .AsString();

testRegexMatches(lowercaseChars, @"(\w{1000})+", times);
//  0.857 sec. (inputLength =  1,000,000)
//  9.044 sec. (inputLength = 10,000,000)

See more performance tests here.

Implemented Regex Features

RegexParser is a fairly complete regex engine. The following constructs have been implemented:

Missing Features

Still on the TODO list: