Awesome
regexy
A Python library for parsing, compiling, and executing regular expressions. All searches execute in linear time with respect to the size of the regular expression and search text.
Initial work inspired by Thompson's NFA paper.
APIs inspired by RE2 and Python's re module.
Implementation is my own and it's not influenced by any existing solution.
Be aware! This is nothing more than an experiment for researching purposes.
Status
- Matcher
- Basic operators:
*
,?
,+
and|
- Capturing groups
- Symbols escaping
- Shorthands:
\w
,\d
,\s
,\W
,\D
,\S
- Sets
[...]
(+ ranges and shorthands) - Repetition ranges
{n, m}
- non-capturing groups
- Greedy and non-greedy match
-
^
and$
symbols -
\b
word boundary - Match any (dot)
- Sets complement
- Lookahead assertion
(?=...)
and(?!...)
(limited to a single char) - Assertions
\A
,\z
,\B
- Named capturing groups
-
search
-
full_match
- Flags
- User friendly compiling errors
- ... ?
Compatibility
- Python +3.5
Install
$ pip install regexy
Usage
Notice regexy
returns all capturing groups specified within a repeated sub-expression
import regexy
regexy.match(regexy.compile(r'((a)*b)'), 'aab')
# Match<('aab', ('a', 'a'))>
regexy.match(regexy.compile(r'a'), 'b')
# None
regexy.match(regexy.compile(r'a'), 'a')
# Match<()>
Streams are supported (i.e: network and files)
Note: Capturing may take as much RAM as all of the data in worst case when the full regex is captured
import io
import regexy
def stream_gen():
stream = io.BytesIO(b'Im a stream')
stream_wrapper = io.TextIOWrapper(stream, encoding='utf-8', write_through=True)
while True:
chars = stream_wrapper.read(5)
if not chars:
break
yield from chars
regexy.match(regexy.compile(r'(\w+| +)*'), stream_gen())
# (('Im', ' ', 'a', ' ', 'stream'),)
Here is a (undocumented) way to print the generated NFA for debugging purposes:
import regexy
str(regexy.compile(r'a*').state)
# ('*', [('a', [('*', [...])]), ('EOF', [])])
# The [...] thing means it's recursive
Docs
Tests
$ make test
License
MIT