Home

Awesome

BRex for better Regexing!

Licensed under the MIT License PR's Welcome Brex Build

This project is a reimagining of regular language and other structured text processing tools!

Practical experience with regular expressions, and emperical results from academic reseaerch including Regexes are Hard: Decision-making, Difficulties, and Risks in Programming Regular Expressions and Exploring Regular Expression Usage and Context in Python, have shown that classic PCRE style regular expression languages have maintainability, readability, and correctness issues. These challenges are particularly acute in the context of the Bosque Object Notation language (BSQON) project, where regular expressions are extensively used to describe and validate rich datatypes.

The BRex regex language is designed to be a more maintainable, readable, and correct alternative to classic PCRE style regular expressions as well as support novel features of that are useful in the context of describe and validating data with a regular language.

Goals

Specific goals for the BRex language include:

Overall Design

BRex introduces new text and path languages + core infrastructure components for structured text processing. The goal is to provide a core native API for embedding into other applications that operates on byte buffers and provides a uniform interface for operating on strings with regular expressions. On top of this core API this project exposes a Node.js native module TODO and a (command line tool)[docs/brex_cmd.md], brex, that provides a simple AWK like interface for using BRex expressions to process text files. Finally, the project plans to leverage improvements in the expression semantics to create improved (and novel new) tools to working with these languages (see the issues tracker).

Unicode support is a foundational part of the BRex design and implementation. As such the BRex language support the full unicode char set and is fully utf8 aware. However, in many cases the simplicity is desired and BRex provides an explicit simple ASCII Char regex and processing pipeline as well.

The matching engine is based on a NFA simulation to ensure that the average case performance is efficient and that the worst case performance is not pathalogical (ReDOS). We also restrict the regex forms used in searching so that we can always use a fast string search algorithm (e.g. Boyer-Moore) to quickly scan for start/end positions. This ensures that BRex can be used in a wide range of applications, particularly data validation, without severe risks around performance issues.

Finally, BRex includes a number of novel language features that extend classic regular expression langauges with features that are useful in the context of data validation and specification. These include named patterns, conjunction, negation, and an explicit URI path language (BPath). These features, combined with various ergonomic improvements, make BRex a powerful and expressive language for working with regular expressions and structured textual data more generally!

Notable Features

BRex includes a number of distinct features that are not present in classic PCRE style regular expressions. These include:

Example Expressions

A simple regex -- letter h followed by one or more vowels

In BRex this is expressed as and defaults to a Unicode regex:

/"h"[aeiou]+/

We can also specify that it matches ASCII printable and whitespace (using an c literal and the c flag):

/'h'[aeiou]+/c

Comments and line breaks are fine too (note whitespace is ignored outside of literals and ranges):

/
  "h"      %% start with h
  [aeiou]+ %% followed by one or more vowels
/

Using unicode and escapes

We can use unicode directly in the regex:

/"🌶" %*unicode pepper*%/

Or we can use hex escapes:

/"%x1f335; %x59;" %*unicode 🌵 and Y*%/

Common escapes are also supported:

/"%NUL; %n; %%; %;" %* null, newline, literal %, and a " quote*%/

Also in ranges:

/[🌵🌶]?/

The usual set of repeats and optional (but no greedy/lazy behaviors)

A simple number regex:

/[+-]? ("0" | [1-9][0-9]+)/

A Zipcode regex:

/[0-9]{5}("-"[0-9]{3})?/

A (simple) filename + short extension regex:

/[a-zA-Z0-9_]+ "." [a-zA-Z0-9]{1,3}/

Named patterns

A simple number regex with named parts (defined previously):

/[+-]? ("0" | ${NonZeroDigit}${Digit}+)/

Conjunction, Start/End Anchors, and Negation

A regex that matches a Zipcode AND that is a valid Kentucky prefix:

/${Zipcode} & ^"4"[0-2]/

A regex that matches a filename that ends with ".txt":

/${FileName} & ".txt"$/

A regex that matches a filename that does not end with ".tmp" or ".scratch":

/${Filename} & !(".tmp" | ".scratch")$/

Matching Anchors

These allow is to find matches that are guarded by other expressions (which we don't want to include in the match).

For a file like mark_abc.txt we can match the abc part but make sure it is contained in the context of the username and not followed by a .tmp or .scratch file:

/"mark_"^<${FilenameFragment}>$!(".tmp" | ".scratch")/

A simple URI path (TODO)

Example Matching

The BRex API provides a range of matching/testing algorithms for various text processing scenarios. All matching algorithms are specialized (via templates) for both Unicode and ASCII processing.

Testing for a match

The simplist case is to take a string and test if it is in the language described by the BRex expression. This is the test method.

bool test(TStr* sstr, ExecutorError& error);

In this case the string between spos and epos is matched. BRex similarly provides a testFront and testBack method that allow for testing if a string starts or ends with a match to the BRex expression as well as a testContains.

Finding a match

BRex does not provide grouping or lazy/eager matches. Instead it always finds the longest match over the full expression (TODO we also want to allow shortest). This ensures that the matching is unique and predictable. You can then chunk out individual parts of the match in additional.

Thus the API for finding a match is simple:

std::optional<std::pair<int64_t, int64_t>> matchContains(TStr* sstr, ExecutorError& error);

Where the std::pair result is the start and end position of the FIRST match in the string and the LONGEST possible match.

As with test we also provide matchFront and matchBack methods that allow for finding the first match at the start or end of the string. And a more verbose version that allows subrange matches.