Awesome

RegexParser

RegexParser is a regular expression engine written in C# that:

is fully featured (character escapes, character classes, greedy/lazy quantifiers, alternations, anchors)
uses backtracking while matching on the target string
follows functional programming principles in its implementation (parser combinators, functional data structures, side-effect free code)
is well tested and performance-optimized

<h2>Contents</h2> <toc>

Structure
Phase 1: Parsing the Regex Pattern
Phase 2: Transforming the Abstract Syntax Tree
Phase 3: Pattern Matching on the Target String
- Matching without Backtracking
- Matching with Backtracking
Testing
Performance
Implemented Regex Features
- Missing Features

</toc>

Structure

RegexParser works in three phases:

Parsing the regex pattern, which results in an Abstract Syntax Tree (AST)
Transforming the AST
Pattern matching on the target string using the AST

Phases 1 and 2 happen only once for a given regex. Phase 3 may happen multiple times, for different target strings.

Phase 1: Parsing the Regex Pattern

The Parser Type

The Parser type is defined as:

public delegate Result<TToken, TTree> Parser<TToken, TTree>(IConsList<TToken> consList);

public class Result<TToken, TTree>
{
    public Result(TTree tree, IConsList<TToken> rest)
    {
        Tree = tree;
        Rest = rest;
    }
    public TTree Tree { get; private set; }
    public IConsList<TToken> Rest { get; private set; }
}

public interface IConsList<T>
{
    T Head { get; }
    IConsList<T> Tail { get; }

    bool IsEmpty { get; }
    int Length { get; }
}

A parser is simply a function that takes a list of tokens (e.g., characters), and returns a syntax tree and the list of unconsumed tokens. To indicate failure to match, it will return null.

The Parser type is equivalent to the following Haskell type:

newtype Parser token tree = Parser ([token] -> Maybe (tree, [token]))

NOTE: The idea (and syntax) of parser combinators came from these articles:

Monadic Parsing in Haskell (Hutton, Meijer) (1998)

Parsec, a fast combinator parser (Leijen) (2001)

In the articles, the type is defined similarly to:

newtype Parser token tree = Parser ([token] -> [(tree, [token])])

This allows the parser to be ambiguous (to parse a string in multiple ways). The parser will return either a list of one or more "success" alternatives, or an empty list to indicate failure.

As the regex syntax is non-ambigious, the Maybe definition was preferred.

Parser Combinators in C#

Parser combinators are higher-order functions that can be used to combine basic parsers to construct parsers for more complex rules.

The following parser combinators are defined (see source for descriptions):

Choice
Option
Many
Many1
Count
Sequence
PrefixedBy
Between
SepBy
SepBy1
NotFollowedBy
Eof
AnyToken
Succeed
Fail

For example, the Choice combinator is defined as:

// Try to apply the parsers in the 'choices' list in order, until one succeeds.
// Return the tree returned by the succeeding parser.
public static Parser<TToken, TTree> Choice<TTree>(params Parser<TToken, TTree>[] choices)
{
    return consList =>
    {
        foreach (var parser in choices)
        {
            var result = parser(consList);
            if (result != null)
                return result;
        }
        return null;
    };
}

Beside combinators, there are also a number of "primitive" character parsers (see source for descriptions):

AnyChar
Satisfy
Char
Digit
OctDigit
HexDigit
OneOf
NoneOf

Each of these will consume exactly one character.

The Parser Monad in C#

LINQ, the data querying subset of C#, offers a form of syntactic sugar that allows writing code similar to Haskell do notation. This greatly simplifies the writing of more complex parsers.

For example, let's say we want to write a parser called naturalNum, which reads a sequence of digits and returns an int as syntactic tree. Using parser combinators and primitives from the previous section (i.e., Many1 and Digit), we can define it like this:

Parser<char, int> naturalNum =
    consList =>
    {
        var result = Many1(Digit)(consList);

        if (result != null)
            return new Result<char, int>(readInt(result.Tree), result.Rest);
        else
            return null;
    };

Because this is such a common pattern, we can define a helper extension method, Select():

public static Parser<TToken, TTree2> Select<TToken, TTree, TTree2>(
                                        this Parser<TToken, TTree> parser,
                                        Func<TTree, TTree2> selector)
{
    return consList =>
    {
        var result = parser(consList);

        if (result != null)
            return new Result<TToken, TTree2>(selector(result.Tree), result.Rest);
        else
            return null;
    };
}

Now the parser can be written more simply as:

Parser<char, int> naturalNum = Many1(Digit).Select(ds => readInt(ds));

A Select() method with a signature similar to ours has special meaning for LINQ. Taking advantage of that, we can rewrite the parser in a syntactic sugar form, which will be translated (desugared) by the C# preprocessor to exactly the same code as above:

Parser<char, int> naturalNum = from ds in Many1(Digit)
                               select readInt(ds);

Notice the similarity with do notation in Haskell:

naturalNum = do ds <- many1 digit
                return (readInt ds)

So far, we could only use the from keyword once per expression. By also defining a LINQ-related method called SelectMany() (see source), we become able to build parser expressions that use from more than once. For example, if we want to parse an integer (prefixed by an optional minus sign), we can write:

Parser<char, int> integerNum = from sign in Option('+', Char('-'))
                               from ds in Many1(Digit)
                               let s = (sign == '-' ? -1 : 1)
                               select s * readInt(ds);

This is equivalent to the following Haskell parser:

integerNum = do sign <- option '+' (char '-')
                ds <- many1 digit
                let s = if sign == '-' then -1 else 1
                return (s * (readInt ds))

Parsing the Regex Language

Using parser combinators and primitives, as well as the syntactic sugar notation described, we can write a parser for the whole regex language (as supported by RegexParser) in less than 150 lines of code (see source).

For example, the Quantifier parser, which accepts suffixes <code>*</code>, <code>+</code>, <code>?</code>, <code>{n}</code>, <code>{n,}</code>, <code>{n,m}</code> (greedy quantifiers), and <code>*?</code>, <code>+?</code>, <code>??</code>, <code>{n}?</code>, <code>{n,}?</code>, <code>{n,m}?</code> (lazy quantifiers), is defined like this:

var RangeQuantifierSuffix = Between(Char('{'),
                                    Char('}'),

                                    from min in NaturalNum
                                    from max in
                                        Option(min, PrefixedBy(Char(','),
                                                               Option(null, Nullable(NaturalNum))))
                                    select new { Min = min, Max = max });

Quantifier = from child in Atom
             from quant in
                 Choice(
                     from _q in Char('*') select new { Min = 0, Max = (int?)null },
                     from _q in Char('+') select new { Min = 1, Max = (int?)null },
                     from _q in Char('?') select new { Min = 0, Max = (int?)1 },
                     RangeQuantifierSuffix)
             from greedy in
                 Option(true, from _c in Char('?')
                              select false)
             select (BasePattern)new QuantifierPattern(child, quant.Min, quant.Max, greedy);

The more complex parsers are built from more simple ones. The topmost parser is called Regex. The result of parsing is a tree of pattern objects (derived from class BasePattern). Here are the main pattern classes (see sources):

CharEscapePattern
CharGroupPattern, CharRangePattern, CharClassSubtractPattern, AnyCharPattern, CaseInsensitiveCharPattern (dealing with character classes)
GroupPattern
QuantifierPattern
AlternationPattern
AnchorPattern

All pattern classes are immutable.

Phase 2: Transforming the Abstract Syntax Tree

The following transforms are performed (see sources):

BaseASTTransform: Remove empty groups; replace non-capturing groups that have a single child pattern with the pattern itself.
QuantifierASTTransform: Split quantifiers into their deterministic and non-deterministic parts. For example, the QuantifierPattern representing <code>a{2,5}</code> will be split into two patterns, equivalent to <code>a{2}a{0,3}</code>. The second pattern is fully non-deterministic, which means that backtracking can and will be used at every step.

Also, clean up the corner cases: quantifiers with empty child patterns, etc.
RegexOptionsASTTransform: Implement the global regex options IgnoreCase, Multiline and Singleline by transforming CharPattern and AnchorPattern objects.

Phase 3: Pattern Matching on the Target String

Matching without Backtracking

The simplest way to parse the target string is to build a parser from the AST using combinators we already have. This, however, would be a non-backtracking parser, as our Parser type does not allow returning multiple "success" alternatives.

Here is a recursive definition (see source), based on the Sequence, Count, Choice and Satisfy combinators:

private Parser<char, string> createParser(BasePattern pattern)
{
    if (pattern == null)
        throw new ArgumentNullException("pattern.", "Pattern is null when creating match parser.");

    switch (pattern.Type)
    {
        case PatternType.Group:
            return from vs in
                       CharParsers.Sequence(((GroupPattern)pattern).Patterns
                                                                   .Select(p => createParser(p)))
                   select vs.JoinStrings();

        case PatternType.Quantifier:
            QuantifierPattern quant = (QuantifierPattern)pattern;
            return from vs in CharParsers.Count(quant.MinOccurrences,
                                                quant.MaxOccurrences,
                                                createParser(quant.ChildPattern))
                   select vs.JoinStrings();

        case PatternType.Alternation:
            return CharParsers.Choice(((AlternationPattern)pattern).Alternatives
                                                                   .Select(p => createParser(p))
                                                                   .ToArray());

        case PatternType.Char:
            return from c in CharParsers.Satisfy(((CharPattern)pattern).IsMatch)
                   select new string(c, 1);

        default:
            throw new ApplicationException(
                string.Format("ExplicitDFAMatcher: unrecognized pattern type ({0}).",
                              pattern.GetType().Name));
    }
}

As a further drawback, this parser does not (and cannot) deal with anchor patterns.

Matching with Backtracking

To understand the need for backtracking, let's consider a simple example: we want to find all the words that end with <code>t</code> within the target <code>"a lot of important text"</code>.

We start with the most logical pattern: <code>\w+t</code> (any word character, repeated one or more times, followed by <code>t</code>). This doesn't work as expected: <code>\w+</code> will match <code>lot</code> (including the final <code>t</code>), so the <code>t</code> in the pattern won't have anything left to match.

We then try the pattern <code>[\w-[t]]+t</code> (any word character except <code>t</code>, repeated one or more times, followed by <code>t</code>). This will match <code>lot</code> correctly, but then it will match <code>import</code>, <code>ant</code>, and <code>ext</code>. Not correct.

Next we try <code>\w+?t</code> (any word character, repeated one or more times lazily, followed by <code>t</code>). This produces the same result as above. (Not to mention the fact that lazy quantifiers actually need backtracking in order to work.)

What would work is the following: suppose that <code>\w+</code> matches every word character including the <code>t</code> (in <code>lot</code>); then, when the parser notices that the next part of the pattern (i.e., <code>t</code>) does not match, it tries to backtrack one match of the quantifier's subpattern (one word character); so now it has matched only <code>lo</code>, and the <code>t</code> in the pattern will match.

Possible complications:

Backtracking might need to go back thousands of matches (all of which need to be kept track of in a stack, like a trail of breadcrumbs).
The moment of "mismatch" may arrive long after the end of the non-deterministic pattern (instead of right after it, as in our example), thousands of characters away, and in a different part of the pattern tree. Jumping back will need to restore the whole context as of just before the match, including location in the pattern tree, and location in the target string.

See the implementation here.

Testing

The main unit tests are matching tests for the supported regex patterns: char escapes, char classes, quantifiers (greedy/lazy), alternations, anchors, and global regex options (see sources).

The tests make use of RegexAssert.AreMatchesSameAsMsoft() (see source) as the main way of asserting correctness. This method compares the result returned by RegexParser (the whole collection of matches, including position in the target string) with the result of the standard .NET implementation (System.Text.RegularExpressions.Regex.Matches()).

There are also unit tests for AST transforms and pattern classes.

Performance

Performance is about one order of magnitude slower than the standard .NET implementation, due in no small part to running on a virtual machine, the .NET Runtime (whereas the Microsoft version is closer to the machine, being written in C).

However, care has been taken to ensure that the performance profile is similar to the standard .NET implementation. So, if the .NET implementation runs in O(n) time, the RegexParser version will also run in O(n) time.

There are a number of "informal" performance tests, run from the console, with performance figures attached to code as comments. For example (see source):

const int times = 1, inputLength = 1000000;    // or 10000000

string alphabet = "abcdefghijklmnopqrstuvwxyz";
string lowercaseChars = EnumerablePerformanceTests.RepeatChars(alphabet, inputLength)
                                                  .AsString();

testRegexMatches(lowercaseChars, @"(\w{1000})+", times);
//  0.857 sec. (inputLength =  1,000,000)
//  9.044 sec. (inputLength = 10,000,000)

See more performance tests here.

Implemented Regex Features

RegexParser is a fairly complete regex engine. The following constructs have been implemented:

Character escapes:
- any character except for one of <code>.$^{[(|)*+?\</code> matches itself
- \n: new line
- \r: carriage return
- \t: tab
- \b: backspace (only inside a character class)
- <code><strong>\</strong>nnn</code>: ASCII character, where nnn is a two- or three-digit octal character code
- <code><strong>\x</strong>nn</code>: ASCII character, where nn is a two-digit hexadecimal character code
- <code><strong>\u</strong>nnnn</code>: UTF-16 code unit whose value is nnnn hexadecimal
- \ followed by a character not recognized as escaped matches that character
Character classes:
- <code>.</code> matches any character except <code>\n</code>
- positive character groups (e.g., <code>[aeiou]</code>, <code>[a-zA-Z]</code>, <code>[abcA-H\d\n]</code>)
- negative character groups (e.g., <code>[^a-zA-Z]</code>)
- named character classes:
  - \w: a word character; same as <code>[0-9A-Z_a-z]</code>
  - \W: a non-word character; same as <code>[^0-9A-Z_a-z]</code>
  - \s: a whitespace character; same as <code>[ \n\r\t]</code>
  - \S: a non-whitespace character; same as <code>[^ \n\r\t]</code>
  - \d: a digit character; same as <code>[0-9]</code>
  - \D: a non-digit character; same as <code>[^0-9]</code>
- character class subtraction (e.g., <code>[0-9-[246]]</code> matches any digit except for 2, 4 and 6)
Grouping (without capturing): <code>(subexpr)</code>
Quantifiers:
- Greedy: <code>*</code>, <code>+</code>, <code>?</code>, <code>{n}</code>, <code>{n,}</code>, <code>{n,m}</code>
- Lazy: <code>*?</code>, <code>+?</code>, <code>??</code>, <code>{n}?</code>, <code>{n,}?</code>, <code>{n,m}?</code>
<blockquote>The difference between _greedy_ and _lazy_ quantifiers is in how they control backtracking. _Greedy_ quantifiers will first try to match _as many_ characters as possible. Then, if the rest of the regex does not match, they will backtrack to matching one character _less_, then try again on the rest of the regex--and so on, one character _less_ every time. _Lazy_ quantifiers, on the other hand, will first try to match _as few_ characters as possible, then backtrack to matching one character _more_ every time.</blockquote>
Alternation: |
Anchors:
- <code>^</code>: start of string or line (depending on the Multiline option)
- <code>$</code>: end of string or line (depending on the Multiline option)
- \A: start of string only
- \Z: end of string or before ending newline
- \z: end of string only
- \G: contiguous match (must start where previous match ended)
- \b: word boundary
- \B: non-word boundary
Regex options (global):
- IgnoreCase
- Multiline: <code>^</code> and <code>$</code> will match the beginning and end of each line (instead of the beginning and end of the input string)
- Singleline: the period (<code>.</code>) will match every character (instead of every character except \n)

Missing Features

Still on the TODO list:

Group related:
- capturing groups
- backreferences
- named groups
- atomic groups (non-backtracking)
Substitution:
- Regex.Replace() method
- substitution patterns: <code>$$</code>, <code>$1</code>, etc.
Look-ahead
Look-behind
Regex options:
- IgnorePatternWhitespace
- ExplicitCapture
- inline options (as opposed to global)
Comments in patterns: <code>(?#comment)</code>