Awesome

regexy

A Python library for parsing, compiling, and executing regular expressions. All searches execute in linear time with respect to the size of the regular expression and search text.

Initial work inspired by Thompson's NFA paper.
APIs inspired by RE2 and Python's re module.
Implementation is my own and it's not influenced by any existing solution.

Be aware! This is nothing more than an experiment for researching purposes.

Status

Compatibility

Python +3.5

Install

$ pip install regexy

Usage

Notice regexy returns all capturing groups specified within a repeated sub-expression

import regexy

regexy.match(regexy.compile(r'((a)*b)'), 'aab')
# Match<('aab', ('a', 'a'))>

regexy.match(regexy.compile(r'a'), 'b')
# None

regexy.match(regexy.compile(r'a'), 'a')
# Match<()>

Streams are supported (i.e: network and files)

Note: Capturing may take as much RAM as all of the data in worst case when the full regex is captured

import io
import regexy


def stream_gen():
    stream = io.BytesIO(b'Im a stream')
    stream_wrapper = io.TextIOWrapper(stream, encoding='utf-8', write_through=True)

    while True:
        chars = stream_wrapper.read(5)

        if not chars:
            break

        yield from chars

regexy.match(regexy.compile(r'(\w+| +)*'), stream_gen())
# (('Im', ' ', 'a', ' ', 'stream'),)

Here is a (undocumented) way to print the generated NFA for debugging purposes:

import regexy

str(regexy.compile(r'a*').state)
# ('*', [('a', [('*', [...])]), ('EOF', [])])
# The [...] thing means it's recursive

Docs

Read The Docs

Tests

$ make test

License

MIT