Awesome
RegexParser
RegexParser is a regular expression engine written in C# that:
- is fully featured (character escapes, character classes, greedy/lazy quantifiers, alternations, anchors)
- uses backtracking while matching on the target string
- follows functional programming principles in its implementation (parser combinators, functional data structures, side-effect free code)
- is well tested and performance-optimized
- Structure
- Phase 1: Parsing the Regex Pattern
- Phase 2: Transforming the Abstract Syntax Tree
- Phase 3: Pattern Matching on the Target String
- Testing
- Performance
- Implemented Regex Features
Structure
RegexParser works in three phases:
- Parsing the regex pattern, which results in an Abstract Syntax Tree (AST)
- Transforming the AST
- Pattern matching on the target string using the AST
Phases 1 and 2 happen only once for a given regex. Phase 3 may happen multiple times, for different target strings.
Phase 1: Parsing the Regex Pattern
The Parser Type
The Parser
type is defined as:
public delegate Result<TToken, TTree> Parser<TToken, TTree>(IConsList<TToken> consList);
public class Result<TToken, TTree>
{
public Result(TTree tree, IConsList<TToken> rest)
{
Tree = tree;
Rest = rest;
}
public TTree Tree { get; private set; }
public IConsList<TToken> Rest { get; private set; }
}
public interface IConsList<T>
{
T Head { get; }
IConsList<T> Tail { get; }
bool IsEmpty { get; }
int Length { get; }
}
A parser is simply a function that takes a list of tokens (e.g., characters), and returns a syntax tree and the list of unconsumed tokens. To indicate failure to match, it will return null
.
The Parser
type is equivalent to the following Haskell type:
newtype Parser token tree = Parser ([token] -> Maybe (tree, [token]))
NOTE: The idea (and syntax) of parser combinators came from these articles:
- Monadic Parsing in Haskell (Hutton, Meijer) (1998)
- Parsec, a fast combinator parser (Leijen) (2001)
In the articles, the type is defined similarly to:
newtype Parser token tree = Parser ([token] -> [(tree, [token])])
This allows the parser to be ambiguous (to parse a string in multiple ways). The parser will return either a list of one or more "success" alternatives, or an empty list to indicate failure.
As the regex syntax is non-ambigious, the
Maybe
definition was preferred.
Parser Combinators in C#
Parser combinators are higher-order functions that can be used to combine basic parsers to construct parsers for more complex rules.
The following parser combinators are defined (see source for descriptions):
Choice
Option
Many
Many1
Count
Sequence
PrefixedBy
Between
SepBy
SepBy1
NotFollowedBy
Eof
AnyToken
Succeed
Fail
For example, the Choice
combinator is defined as:
// Try to apply the parsers in the 'choices' list in order, until one succeeds.
// Return the tree returned by the succeeding parser.
public static Parser<TToken, TTree> Choice<TTree>(params Parser<TToken, TTree>[] choices)
{
return consList =>
{
foreach (var parser in choices)
{
var result = parser(consList);
if (result != null)
return result;
}
return null;
};
}
Beside combinators, there are also a number of "primitive" character parsers (see source for descriptions):
AnyChar
Satisfy
Char
Digit
OctDigit
HexDigit
OneOf
NoneOf
Each of these will consume exactly one character.
The Parser Monad in C#
LINQ, the data querying subset of C#, offers a form of syntactic sugar that allows writing code similar to Haskell do
notation. This greatly simplifies the writing of more complex parsers.
For example, let's say we want to write a parser called naturalNum
, which reads a sequence of digits and returns an int
as syntactic tree. Using parser combinators and primitives from the previous section (i.e., Many1
and Digit
), we can define it like this:
Parser<char, int> naturalNum =
consList =>
{
var result = Many1(Digit)(consList);
if (result != null)
return new Result<char, int>(readInt(result.Tree), result.Rest);
else
return null;
};
Because this is such a common pattern, we can define a helper extension method, Select()
:
public static Parser<TToken, TTree2> Select<TToken, TTree, TTree2>(
this Parser<TToken, TTree> parser,
Func<TTree, TTree2> selector)
{
return consList =>
{
var result = parser(consList);
if (result != null)
return new Result<TToken, TTree2>(selector(result.Tree), result.Rest);
else
return null;
};
}
Now the parser can be written more simply as:
Parser<char, int> naturalNum = Many1(Digit).Select(ds => readInt(ds));
A Select()
method with a signature similar to ours has special meaning for LINQ. Taking advantage of that, we can rewrite the parser in a syntactic sugar form, which will be translated (desugared) by the C# preprocessor to exactly the same code as above:
Parser<char, int> naturalNum = from ds in Many1(Digit)
select readInt(ds);
Notice the similarity with do
notation in Haskell:
naturalNum = do ds <- many1 digit
return (readInt ds)
So far, we could only use the from
keyword once per expression. By also defining a LINQ-related method called SelectMany()
(see source), we become able to build parser expressions that use from
more than once. For example, if we want to parse an integer (prefixed by an optional minus sign), we can write:
Parser<char, int> integerNum = from sign in Option('+', Char('-'))
from ds in Many1(Digit)
let s = (sign == '-' ? -1 : 1)
select s * readInt(ds);
This is equivalent to the following Haskell parser:
integerNum = do sign <- option '+' (char '-')
ds <- many1 digit
let s = if sign == '-' then -1 else 1
return (s * (readInt ds))
Parsing the Regex Language
Using parser combinators and primitives, as well as the syntactic sugar notation described, we can write a parser for the whole regex language (as supported by RegexParser) in less than 150 lines of code (see source).
For example, the Quantifier
parser, which accepts suffixes <code>*</code>, <code>+</code>, <code>?</code>, <code>{n}</code>, <code>{n,}</code>, <code>{n,m}</code> (greedy quantifiers), and <code>*?</code>, <code>+?</code>, <code>??</code>, <code>{n}?</code>, <code>{n,}?</code>, <code>{n,m}?</code> (lazy quantifiers), is defined like this:
var RangeQuantifierSuffix = Between(Char('{'),
Char('}'),
from min in NaturalNum
from max in
Option(min, PrefixedBy(Char(','),
Option(null, Nullable(NaturalNum))))
select new { Min = min, Max = max });
Quantifier = from child in Atom
from quant in
Choice(
from _q in Char('*') select new { Min = 0, Max = (int?)null },
from _q in Char('+') select new { Min = 1, Max = (int?)null },
from _q in Char('?') select new { Min = 0, Max = (int?)1 },
RangeQuantifierSuffix)
from greedy in
Option(true, from _c in Char('?')
select false)
select (BasePattern)new QuantifierPattern(child, quant.Min, quant.Max, greedy);
The more complex parsers are built from more simple ones. The topmost parser is called Regex
. The result of parsing is a tree of pattern objects (derived from class BasePattern
). Here are the main pattern classes (see sources):
CharEscapePattern
CharGroupPattern
,CharRangePattern
,CharClassSubtractPattern
,AnyCharPattern
,CaseInsensitiveCharPattern
(dealing with character classes)GroupPattern
QuantifierPattern
AlternationPattern
AnchorPattern
All pattern classes are immutable.
Phase 2: Transforming the Abstract Syntax Tree
The following transforms are performed (see sources):
-
BaseASTTransform
: Remove empty groups; replace non-capturing groups that have a single child pattern with the pattern itself. -
QuantifierASTTransform
: Split quantifiers into their deterministic and non-deterministic parts. For example, theQuantifierPattern
representing <code>a{2,5}</code> will be split into two patterns, equivalent to <code>a{2}a{0,3}</code>. The second pattern is fully non-deterministic, which means that backtracking can and will be used at every step.Also, clean up the corner cases: quantifiers with empty child patterns, etc.
-
RegexOptionsASTTransform
: Implement the global regex optionsIgnoreCase
,Multiline
andSingleline
by transformingCharPattern
andAnchorPattern
objects.
Phase 3: Pattern Matching on the Target String
Matching without Backtracking
The simplest way to parse the target string is to build a parser from the AST using combinators we already have. This, however, would be a non-backtracking parser, as our Parser
type does not allow returning multiple "success" alternatives.
Here is a recursive definition (see source), based on the Sequence
, Count
, Choice
and Satisfy
combinators:
private Parser<char, string> createParser(BasePattern pattern)
{
if (pattern == null)
throw new ArgumentNullException("pattern.", "Pattern is null when creating match parser.");
switch (pattern.Type)
{
case PatternType.Group:
return from vs in
CharParsers.Sequence(((GroupPattern)pattern).Patterns
.Select(p => createParser(p)))
select vs.JoinStrings();
case PatternType.Quantifier:
QuantifierPattern quant = (QuantifierPattern)pattern;
return from vs in CharParsers.Count(quant.MinOccurrences,
quant.MaxOccurrences,
createParser(quant.ChildPattern))
select vs.JoinStrings();
case PatternType.Alternation:
return CharParsers.Choice(((AlternationPattern)pattern).Alternatives
.Select(p => createParser(p))
.ToArray());
case PatternType.Char:
return from c in CharParsers.Satisfy(((CharPattern)pattern).IsMatch)
select new string(c, 1);
default:
throw new ApplicationException(
string.Format("ExplicitDFAMatcher: unrecognized pattern type ({0}).",
pattern.GetType().Name));
}
}
As a further drawback, this parser does not (and cannot) deal with anchor patterns.
Matching with Backtracking
To understand the need for backtracking, let's consider a simple example: we want to find all the words that end with <code>t</code> within the target <code>"a lot of important text"</code>.
We start with the most logical pattern: <code>\w+t</code> (any word character, repeated one or more times, followed by <code>t</code>). This doesn't work as expected: <code>\w+</code> will match <code>lot</code> (including the final <code>t</code>), so the <code>t</code> in the pattern won't have anything left to match.
We then try the pattern <code>[\w-[t]]+t</code> (any word character except <code>t</code>, repeated one or more times, followed by <code>t</code>). This will match <code>lot</code> correctly, but then it will match <code>import</code>, <code>ant</code>, and <code>ext</code>. Not correct.
Next we try <code>\w+?t</code> (any word character, repeated one or more times lazily, followed by <code>t</code>). This produces the same result as above. (Not to mention the fact that lazy quantifiers actually need backtracking in order to work.)
What would work is the following: suppose that <code>\w+</code> matches every word character including the <code>t</code> (in <code>lot</code>); then, when the parser notices that the next part of the pattern (i.e., <code>t</code>) does not match, it tries to backtrack one match of the quantifier's subpattern (one word character); so now it has matched only <code>lo</code>, and the <code>t</code> in the pattern will match.
Possible complications:
-
Backtracking might need to go back thousands of matches (all of which need to be kept track of in a stack, like a trail of breadcrumbs).
-
The moment of "mismatch" may arrive long after the end of the non-deterministic pattern (instead of right after it, as in our example), thousands of characters away, and in a different part of the pattern tree. Jumping back will need to restore the whole context as of just before the match, including location in the pattern tree, and location in the target string.
See the implementation here.
Testing
The main unit tests are matching tests for the supported regex patterns: char escapes, char classes, quantifiers (greedy/lazy), alternations, anchors, and global regex options (see sources).
The tests make use of RegexAssert.AreMatchesSameAsMsoft()
(see source) as the main way of asserting correctness. This method compares the result returned by RegexParser (the whole collection of matches, including position in the target string) with the result of the standard .NET implementation (System.Text.RegularExpressions.Regex.Matches()
).
There are also unit tests for AST transforms and pattern classes.
Performance
Performance is about one order of magnitude slower than the standard .NET implementation, due in no small part to running on a virtual machine, the .NET Runtime (whereas the Microsoft version is closer to the machine, being written in C).
However, care has been taken to ensure that the performance profile is similar to the standard .NET implementation. So, if the .NET implementation runs in O(n) time, the RegexParser version will also run in O(n) time.
There are a number of "informal" performance tests, run from the console, with performance figures attached to code as comments. For example (see source):
const int times = 1, inputLength = 1000000; // or 10000000
string alphabet = "abcdefghijklmnopqrstuvwxyz";
string lowercaseChars = EnumerablePerformanceTests.RepeatChars(alphabet, inputLength)
.AsString();
testRegexMatches(lowercaseChars, @"(\w{1000})+", times);
// 0.857 sec. (inputLength = 1,000,000)
// 9.044 sec. (inputLength = 10,000,000)
See more performance tests here.
Implemented Regex Features
RegexParser is a fairly complete regex engine. The following constructs have been implemented:
- Character escapes:
- any character except for one of <code>.$^{[(|)*+?\</code> matches itself
\n
: new line\r
: carriage return\t
: tab\b
: backspace (only inside a character class)- <code><strong>\</strong>nnn</code>: ASCII character, where
nnn
is a two- or three-digit octal character code - <code><strong>\x</strong>nn</code>: ASCII character, where
nn
is a two-digit hexadecimal character code - <code><strong>\u</strong>nnnn</code>: UTF-16 code unit whose value is
nnnn
hexadecimal \
followed by a character not recognized as escaped matches that character
- Character classes:
- <code>.</code> matches any character except <code>\n</code>
- positive character groups (e.g., <code>[aeiou]</code>, <code>[a-zA-Z]</code>, <code>[abcA-H\d\n]</code>)
- negative character groups (e.g., <code>[^a-zA-Z]</code>)
- named character classes:
\w
: a word character; same as <code>[0-9A-Z_a-z]</code>\W
: a non-word character; same as <code>[^0-9A-Z_a-z]</code>\s
: a whitespace character; same as <code>[ \n\r\t]</code>\S
: a non-whitespace character; same as <code>[^ \n\r\t]</code>\d
: a digit character; same as <code>[0-9]</code>\D
: a non-digit character; same as <code>[^0-9]</code>
- character class subtraction (e.g., <code>[0-9-[246]]</code> matches any digit except for 2, 4 and 6)
- Grouping (without capturing): <code>(subexpr)</code>
- Quantifiers:
- Greedy: <code>*</code>, <code>+</code>, <code>?</code>, <code>{n}</code>, <code>{n,}</code>, <code>{n,m}</code>
- Lazy: <code>*?</code>, <code>+?</code>, <code>??</code>, <code>{n}?</code>, <code>{n,}?</code>, <code>{n,m}?</code>
- Alternation:
|
- Anchors:
- <code>^</code>: start of string or line (depending on the
Multiline
option) - <code>$</code>: end of string or line (depending on the
Multiline
option) \A
: start of string only\Z
: end of string or before ending newline\z
: end of string only\G
: contiguous match (must start where previous match ended)\b
: word boundary\B
: non-word boundary
- <code>^</code>: start of string or line (depending on the
- Regex options (global):
IgnoreCase
Multiline
: <code>^</code> and <code>$</code> will match the beginning and end of each line (instead of the beginning and end of the input string)Singleline
: the period (<code>.</code>) will match every character (instead of every character except\n
)
Missing Features
Still on the TODO list:
-
Group related:
- capturing groups
- backreferences
- named groups
- atomic groups (non-backtracking)
-
Substitution:
Regex.Replace()
method- substitution patterns: <code>$$</code>, <code>$1</code>, etc.
-
Look-ahead
-
Look-behind
-
Regex options:
IgnorePatternWhitespace
ExplicitCapture
- inline options (as opposed to global)
-
Comments in patterns: <code>(?#comment)</code>