Home

Awesome

Pidgin

A lightweight, fast, and flexible parsing library for C#.

Installing

Pidgin is available on Nuget. API docs are hosted on my website.

Tutorial

There's a tutorial on using Pidgin to parse a subset of Prolog on my website.

Getting started

Pidgin is a parser combinator library, a lightweight, high-level, declarative tool for constructing parsers. Parsers written with parser combinators look like a high-level specification of a language's grammar, but they're expressed within a general-purpose programming language and require no special tools to produce executable code. Parser combinators are more powerful than regular expressions - they can parse a larger class of languages - but simpler and easier to use than parser generators like ANTLR.

Pidgin's core type, Parser<TToken, T>, represents a procedure which consumes an input stream of TTokens, and may either fail with a parsing error or produce a T as output. You can think of it as:

delegate T? Parser<TToken, T>(IEnumerator<TToken> input);

In order to start building parsers we need to import two classes which contain factory methods: Parser and Parser<TToken>.

using Pidgin;
using static Pidgin.Parser;
using static Pidgin.Parser<char>;  // we'll be parsing strings - sequences of characters. For other applications (eg parsing binary file formats) TToken may be some other type (eg byte).

Primitive parsers

Now we can create some simple parsers. Any represents a parser which consumes a single character and returns that character.

Assert.AreEqual('a', Any.ParseOrThrow("a"));
Assert.AreEqual('b', Any.ParseOrThrow("b"));

Char, an alias for Token, consumes a particular character and returns that character. If it encounters some other character then it fails.

Parser<char, char> parser = Char('a');
Assert.AreEqual('a', parser.ParseOrThrow("a"));
Assert.Throws<ParseException>(() => parser.ParseOrThrow("b"));

Digit parses and returns a single digit character.

Assert.AreEqual('3', Digit.ParseOrThrow("3"));
Assert.Throws<ParseException>(() => Digit.ParseOrThrow("a"));

String parses and returns a particular string. If you give it input other than the string it was expecting it fails.

Parser<char, string> parser = String("foo");
Assert.AreEqual("foo", parser.ParseOrThrow("foo"));
Assert.Throws<ParseException>(() => parser.ParseOrThrow("bar"));

Return (and its synonym FromResult) never consumes any input, and just returns the given value. Likewise, Fail always fails without consuming any input.

Parser<char, int> parser = Return(3);
Assert.AreEqual(3, parser.ParseOrThrow("foo"));

Parser<char, int> parser2 = Fail<int>();
Assert.Throws<ParseException>(() => parser2.ParseOrThrow("bar"));

Sequencing parsers

The power of parser combinators is that you can build big parsers out of little ones. The simplest way to do this is using Then, which builds a new parser representing two parsers applied sequentially (discarding the result of the first).

Parser<char, string> parser1 = String("foo");
Parser<char, string> parser2 = String("bar");
Parser<char, string> sequencedParser = parser1.Then(parser2);
Assert.AreEqual("bar", sequencedParser.ParseOrThrow("foobar"));  // "foo" got thrown away
Assert.Throws<ParseException>(() => sequencedParser.ParseOrThrow("food"));

Before throws away the second result, not the first.

Parser<char, string> parser1 = String("foo");
Parser<char, string> parser2 = String("bar");
Parser<char, string> sequencedParser = parser1.Before(parser2);
Assert.AreEqual("foo", sequencedParser.ParseOrThrow("foobar"));  // "bar" got thrown away
Assert.Throws<ParseException>(() => sequencedParser.ParseOrThrow("food"));

Map does a similar job, except it keeps both results and applies a transformation function to them. This is especially useful when you want your parser to return a custom data structure. (Map has overloads which operate on between one and eight parsers; the one-parser version also has a postfix synonym Select.)

Parser<char, string> parser1 = String("foo");
Parser<char, string> parser2 = String("bar");
Parser<char, string> sequencedParser = Map((foo, bar) => bar + foo, parser1, parser2);
Assert.AreEqual("barfoo", sequencedParser.ParseOrThrow("foobar"));
Assert.Throws<ParseException>(() => sequencedParser.ParseOrThrow("food"));

Bind uses the result of a parser to choose the next parser. This enables parsing of context-sensitive languages. For example, here's a parser which parses any character repeated twice.

/// parse any character, then parse a character matching the first character
Parser<char, char> parser = Any.Bind(c => Char(c));
Assert.AreEqual('a', parser.ParseOrThrow("aa"));
Assert.AreEqual('b', parser.ParseOrThrow("bb"));
Assert.Throws<ParseException>(() => parser.ParseOrThrow("ab"));

Pidgin parsers support LINQ query syntax. It may be easier to see what the above example does when it's written out using LINQ:

Parser<char, char> parser =
    from c in Any
    from c2 in Char(c)
    select c2;

Parsers written like this look like a simple imperative script. "Run the Any parser and name its result c, then run Char(c) and name its result c2, then return c2."

Choosing from alternatives

Or represents a parser which can parse one of two alternatives. It runs the left parser first, and if it fails it tries the right parser.

Parser<char, string> parser = String("foo").Or(String("bar"));
Assert.AreEqual("foo", parser.ParseOrThrow("foo"));
Assert.AreEqual("bar", parser.ParseOrThrow("bar"));
Assert.Throws<ParseException>(() => parser.ParseOrThrow("baz"));

OneOf is equivalent to Or, except it takes a variable number of arguments. Here's a parser which is equivalent to the one using Or above:

Parser<char, string> parser = OneOf(String("foo"), String("bar"));

If one of Or or OneOf's component parsers fails after consuming input, the whole parser will fail.

Parser<char, string> parser = String("food").Or(String("foul"));
Assert.Throws<ParseException>(() => parser.ParseOrThrow("foul"));  // why didn't it choose the second option?

What happened here? When a parser successfully parses a character from the input stream, it advances the input stream to the next character. Or only chooses the next alternative if the given parser fails without consuming any input; Pidgin does not perform any lookahead or backtracking by default. Backtracking is enabled using the Try function.

// apply Try to the first option, so we can return to the beginning if it fails
Parser<char, string> parser = Try(String("food")).Or(String("foul"));
Assert.AreEqual("foul", parser.ParseOrThrow("foul"));

Recursive grammars

Almost any non-trivial programming language, markup language, or data interchange language will feature some sort of recursive structure. C# doesn't support recursive values: a recursive referral to a variable currently being initialised will return null. So we need some sort of deferred execution of recursive parsers, which Pidgin enables using the Rec combinator. Here's a simple parser which parses arbitrarily nested parentheses with a single digit inside them.

Parser<char, char> expr = null;
Parser<char, char> parenthesised = Char('(')
    .Then(Rec(() => expr))  // using a lambda to (mutually) recursively refer to expr
    .Before(Char(')'));
expr = Digit.Or(parenthesised);
Assert.AreEqual('1', expr.ParseOrThrow("1"));
Assert.AreEqual('1', expr.ParseOrThrow("(1)"));
Assert.AreEqual('1', expr.ParseOrThrow("(((1)))"));

However, Pidgin does not support left recursion. A parser must consume some input before making a recursive call. The following example will produce a stack overflow because a recursive call to arithmetic occurs before any input can be consumed by Digit or Char('+'):

Parser<char, int> arithmetic = null;
Parser<char, int> addExpr = Map(
    (x, _, y) => x + y,
    Rec(() => arithmetic),
    Char('+'),
    Rec(() => arithmetic)
);
arithmetic = addExpr.Or(Digit.Select(d => (int)char.GetNumericValue(d)));

arithmetic.Parse("2+2");  // stack overflow!

Derived combinators

Another powerful element of this programming model is that you can write your own functions to compose parsers. Pidgin contains a large number of higher-level combinators, built from the primitives outlined above. For example, Between runs a parser surrounded by two others, keeping only the result of the central parser.

Parser<TToken, T> InBraces<TToken, T, U, V>(this Parser<TToken, T> parser, Parser<TToken, U> before, Parser<TToken, V> after)
    => before.Then(parser).Before(after);

Parsing expressions

Pidgin features operator-precedence parsing tools, for parsing expression grammars with associative infix operators. The ExpressionParser class builds a parser from a parser to parse a single expression term and a table of operators with rules to combine expressions.

More examples

Examples, such as parsing (a subset of) JSON and XML into document structures, can be found in the Pidgin.Examples project.

Tips

A note on variance

Why doesn't this code compile?

class Base {}
class Derived : Base {}

Parser<char, Base> p = Return(new Derived());  // Cannot implicitly convert type 'Pidgin.Parser<char, Derived>' to 'Pidgin.Parser<char, Base>'

This would be possible if Parser were defined as a covariant in its second type parameter (ie interface Parser<TToken, out T>). For the purposes of efficiency, Pidgin parsers return a struct. Structs and classes aren't allowed to have variant type parameters (only interfaces and delegates); since a Pidgin parser's return value isn't variant, nor can the parser itself.

In my experience, this crops up most frequently when returning a node of a syntax tree from a parser using Select. The least verbose way of rectifying this is to explicitly set Select's type parameter to the supertype:

Parser<char, Base> p = Any.Select<Base>(() => new Derived());

Speed tips

Pidgin is designed to be fast and produce a minimum of garbage. A carefully written Pidgin parser can be competitive with a hand-written recursive descent parser. If you find that parsing is a bottleneck in your code, here are some tips for minimising the runtime of your parser.

Comparison to other tools

Pidgin vs Sprache

Sprache is another parser combinator library for C# and served as one of the sources of inspiration for Pidgin. Pidgin's API is somewhat similar to that of Sprache, but Pidgin aims to improve on Sprache in a number of ways:

Pidgin vs FParsec

FParsec is a parser combinator library for F# based on Parsec.

Benchmarks

This is how Pidgin compares to other tools in terms of performance. The benches can be found in the Pidgin.Bench project.

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.14393.3384 (1607/AnniversaryUpdate/Redstone1)
Intel Core i5-4460 CPU 3.20GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
Frequency=3125000 Hz, Resolution=320.0000 ns, Timer=TSC
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT

ExpressionBench

MethodMeanErrorStdDevRatioRatioSDGen 0Gen 1Gen 2Allocated
LongInfixL_Pidgin625,148.8 ns3,015.040 ns2,672.7541 ns2.250.01---128 B
LongInfixR_Pidgin625,530.1 ns4,104.833 ns3,839.6633 ns2.250.02---128 B
LongInfixL_FParsec278,035.1 ns1,231.538 ns1,151.9816 ns1.000.00---200 B
LongInfixR_FParsec326,047.3 ns931.485 ns871.3119 ns1.170.01---200 B
ShortInfixL_Pidgin1,506.5 ns5.515 ns5.1590 ns2.670.010.0401--128 B
ShortInfixR_Pidgin1,636.6 ns6.882 ns5.7467 ns2.900.020.0401--128 B
ShortInfixL_FParsec564.1 ns1.894 ns1.6788 ns1.000.000.0629--200 B
ShortInfixR_FParsec567.7 ns1.200 ns0.9373 ns1.010.000.0629--200 B

JsonBench

MethodMeanErrorStdDevRatioRatioSDGen 0Gen 1Gen 2Allocated
BigJson_Pidgin684.6 us2.888 us2.701 us1.000.0033.2031--101.7 KB
BigJson_Sprache3,597.5 us17.595 us16.458 us5.250.031726.5625--5291.81 KB
BigJson_Superpower2,884.4 us6.504 us5.766 us4.210.02296.8750--913.43 KB
BigJson_FParsec750.1 us3.516 us3.289 us1.100.01111.3281--343.43 KB
LongJson_Pidgin517.5 us2.418 us2.261 us1.000.0033.2031--104.25 KB
LongJson_Sprache2,858.5 us10.491 us9.300 us5.530.031390.6250--4269.33 KB
LongJson_Superpower2,348.1 us14.194 us13.277 us4.540.03230.4688--706.79 KB
LongJson_FParsec642.5 us2.708 us2.533 us1.240.01125.0000--384.3 KB
DeepJson_Pidgin399.3 us1.784 us1.582 us1.000.0026.3672--82.24 KB
DeepJson_Sprache2,983.0 us42.512 us39.765 us7.460.09761.7188191.4063-2922.46 KB
DeepJson_FParsec701.8 us1.665 us1.557 us1.760.01112.3047--344.43 KB
WideJson_Pidgin427.8 us1.619 us1.515 us1.000.0015.6250--48.42 KB
WideJson_Sprache1,704.2 us9.246 us8.196 us3.980.02900.3906--2763.22 KB
WideJson_Superpower1,494.6 us9.581 us8.962 us3.490.02148.4375--459.74 KB
WideJson_FParsec379.5 us1.597 us1.494 us0.890.0041.9922--129.02 KB