Home

Awesome

CoreParse

CoreParse is a parsing library for Mac OS X and iOS. It supports a wide range of grammars thanks to its shift/reduce parsing schemes. Currently CoreParse supports SLR, LR(1) and LALR(1) parsers.

For full documentation see http://beelsebob.github.com/CoreParse.

Why Should You use CoreParse

You may wonder why and/or when you should use CoreParse. There are already a number of parsers available in the wild, why should you use this one?

Where is CoreParse Already Used?

CoreParse is already used in a major way in at least two projects:

If you know of any other places it's been used, please feel free to get in touch.

Parsing Guide

CoreParse is a powerful framework for tokenising and parsing. This document explains how to create a tokeniser and parser from scratch, and how to use those parsers to create your model data structures for you. We will follow the same example throughout this document. This will deal with parsing a simple numerical expression and computing the result.

gavineadie has implemented this entire example, to see full working source see https://github.com/beelsebob/ParseTest/.

Tokenisation

CoreParse's tokenisation class is CPTokeniser. To specify how tokens are constructed you must add token recognisers in order of precidence to the tokeniser.

Our example language will involve several symbols, numbers, whitespace, and comments. We add these to the tokeniser:

CPTokeniser *tokeniser = [[[CPTokeniser alloc] init] autorelease];
[tokeniser addTokenRecogniser:[CPNumberRecogniser numberRecogniser]];
[tokeniser addTokenRecogniser:[CPWhiteSpaceRecogniser whiteSpaceRecogniser]];
[tokeniser addTokenRecogniser:[CPQuotedRecogniser quotedRecogniserWithStartQuote:@"/*" endQuote:@"*/" name:@"Comment"]];
[tokeniser addTokenRecogniser:[CPKeywordRecogniser recogniserForKeyword:@"+"]];
[tokeniser addTokenRecogniser:[CPKeywordRecogniser recogniserForKeyword:@"-"]];
[tokeniser addTokenRecogniser:[CPKeywordRecogniser recogniserForKeyword:@"*"]];
[tokeniser addTokenRecogniser:[CPKeywordRecogniser recogniserForKeyword:@"/"]];
[tokeniser addTokenRecogniser:[CPKeywordRecogniser recogniserForKeyword:@"("]];
[tokeniser addTokenRecogniser:[CPKeywordRecogniser recogniserForKeyword:@")"]];

Note that the comment tokeniser is added before the keyword recogniser for the divide symbol. This gives it higher precidence, and means that the first slash of a comment will not be recognised as a division.

Next, we add ourself as a delegate to the tokeniser. We implement the tokeniser delegate methods in such a way that whitespace tokens and comments, although consumed, will not appear in the tokeniser's output:

- (BOOL)tokeniser:(CPTokeniser *)tokeniser shouldConsumeToken:(CPToken *)token
{
    return YES;
}

- (void)tokeniser:(CPTokeniser *)tokeniser requestsToken:(CPToken *)token pushedOntoStream:(CPTokenStream *)stream
{
    if (![token isWhiteSpaceToken] && ![[token name] isEqualToString:@"Comment"])
    {
        [stream pushToken:token];
    }
}

We can now invoke our tokeniser.

CPTokenStream *tokenStream = [tokeniser tokenise:@"5 + (2.0 / 5.0 + 9) * 8"];

Parsing

We construct parsers by specifying their grammar. We can construct a grammar simply using a simple BNF like language. Note the syntax tag@<NonTerminal> can be read simply as <NonTerminal>, the tag can be used later to quickly extract values from the parsed result:

NSString *expressionGrammar =
    @"Expression ::= term@<Term>   | expr@<Expression> op@<AddOp> term@<Term>;"
    @"Term       ::= fact@<Factor> | fact@<Factor>     op@<MulOp> term@<Term>;"
    @"Factor     ::= num@'Number' | '(' expr@<Expression> ')';"
    @"AddOp      ::= '+' | '-';"
    @"MulOp      ::= '*' | '/';";
NSError *err;
CPGrammar *grammar = [CPGrammar grammarWithStart:@"Expression" backusNaurForm:expressionGrammar error:&err];
if (nil == grammar)
{
    NSLog(@"Error creating grammar:");
    NSLog(@"%@", err);
}
else
{
    CPParser *parser = [CPLALR1Parser parserWithGrammar:grammar];
    [parser setDelegate:self];
    ...
}

When a rule is matched by the parser, the initWithSyntaxTree: method will be called on a new instance of the apropriate class. If no such class exists the parser delegate's parser:didProduceSyntaxTree: method is called. To deal with this cleanly, we implement 3 classes: Expression; Term; and Factor. AddOp and MulOp non-terminals are dealt with by the parser delegate. Here we see the initWithSyntaxTree: method for the Expression class, these methods are similar for Term and Factor:

- (id)initWithSyntaxTree:(CPSyntaxTree *)syntaxTree
{
    self = [self init];
    
    if (nil != self)
    {
        Term       *t = [syntaxTree valueForTag:@"term"];
        Expression *e = [syntaxTree valueForTag:@"expr"];
        
        if (nil == e)
        {
            [self setValue:[t value]];
        }
        else if ([[syntaxTree valueForTag:@"op"] isEqualToString:@"+"])
        {
            [self setValue:[e value] + [t value]];
        }
        else
        {
            [self setValue:[e value] - [t value]];
        }
    }
    
    return self;
}

We must also implement the delegate's method for dealing with AddOps and MulOps:

- (id)parser:(CPParser *)parser didProduceSyntaxTree:(CPSyntaxTree *)syntaxTree
{
    return [(CPKeywordToken *)[syntaxTree childAtIndex:0] keyword];
}

We can now parse the token stream we produced earlier:

NSLog(@"%f", [(Expression *)[parser parse:tokenStream] value]);

Which outputs:

80.2

Best Practices

CoreParse offers three types of parser - SLR, LR(1) and LALR(1):

It is recommended that you start with an SLR parser (unless you know better), and when a parser cannot be generated for your grammar, move onto an LALR(1) parser. LR(1) parsers are not really recommended at all, though may be useful in extreme circumstances.

It is recommended that if you have a significant grammar that requires an LALR(1) parser, you should use NSKeyedArchiving to archive the parser to a file. You should then read this file, and unarchive it when your application runs to save generating the parser every time it runs, as parser generation can take some time.