Awesome
foonathan/lex
Note: Replaced by foonathan/lexy.
This library is a C++14 constexpr
tokenization and (in the future) parsing library.
The tokens are specified in the type system so they are available at compile-time.
With this information a trie is constructed that efficiently matches the input.
Basic Example
The tokens for a simple calculator:
using tokens = lex::token_spec<struct variable, struct plus, struct minus, …>;
struct variable : lex::rule_token<variable, tokens>
{
static constexpr auto rule() const noexcept
{
// variables consists of one or more characters
return lex::token_rule::plus(lex::ascii::is_alpha);
}
};
struct plus : lex::literal_token<'+'>
{};
struct minus : lex::literal_token<'-'>
{};
See example/ctokenizer.cpp for an annotated example and tutorial.
Features
- Declarative token specification: No need to worry about ordering or implementing lexing by hand.
- Fast: Performance is comparable or faster to a handwritten state machine, see benchmarks.
- Lightweight: No memory allocation, tokens are just string views into the input.
- Lazy: The
lex::tokenizer
will just tokenize the next token in the input. - Fully
constexpr
: The entire lexing can happen at compile-time. - Flexible error handling: On invalid input, a
lex::error_token
is created consuming one characters. The parser can then decide how an error should be handled.
FAQ
Q: Isn't the name lex already taken?
A: It is. That's why the library is called foonathan/lex
.
In my defense, naming is hard.
I could come up with some cute name, but then its not really descriptive.
If you know foonathan/lex
, you know what the project is about.
Q: Sounds great, but what about compile-time?
A: Compiling the foonathan_lex_ctokenizer
target, which contains an implementation of a tokenizer for C (modulo some details),
takes under three seconds.
Just including <iostream>
takes about half a second, including <iostream>
and <regex>
takes about two seconds.
So the compile time is noticeable, but as a tokenizer will not be used in a lot of files of the project and rarely changes, acceptable.
In the future, I will probably look at optimizing it as well.
Q: My lex::rule_token
doesn't seem to be matched?
A: This could be due to one of two things:
- Multiple rule tokens would match the input. Then the tokenizer just picks the one that comes first.
Make sure that all rule tokens are mutually exclusive, maybe by using
lex::null_token
and creating them all in one place at necessary. Seeint_literal
andfloat_literal
in the C tokenizer for an example. - A literal token is a prefix of the rule token, e.g. a C comment
/* … */
and the/
operator are in conflict. By default, the literal token is preferred in that case. Implementis_conflicting_literal()
in your rule token as done by thecomment
token in the C tokenizer.
A mode to test for this issues is planned.
Q: The lex::tokenizer
gives me just the next token, how do I implement lookahead for specific tokens?
A: Simple call get()
until you've reached the token you want to lookahead, then reset()
the tokenizer to the earlier position.
Q: How does it compare to compile-time-regular-expressions?
A: That project implements a RegEx parser at compile-time, which can be used to match strings. foonathan/lex is project is purely designed to tokenize strings. You could implement a tokenizer with the compile-time RegEx but I have choosen a different approach.
Q: How does it compare to PEGTL?
A: That project implements matching parsing expression grammars (PEGs), which are a more powerful RegEx, basically. On top of that they've implemented a parsing interface, so you can create a parse tree, for example. foonathan/lex currently does just tokenization, but I plan on adding parse rules on top of the tokens later on. Complex tokens in foonathan/lex can be described using PEG as well, but here the PEGs are described using operator overloading and functions, and in PEGTL they are described by the type system.
Q: It breaks when I do this!
A: Don't do that. And file an issue (or a PR, I have a lot of other projects...).
Q: This is awesome!
A: Thanks. I do have a Patreon page, so consider checking it out:
Documentation
Tutorial and reference documentation can be found here.
Compiler Support
The library requires a C++14 compiler with reasonable constexpr
support.
Compilers that are being tested on CI:
- Linux:
- GCC 5 to 8, but compile-time parsing is not supported for GCC < 8 (still works at runtime)
- clang 4 to 7
- MacOS:
- XCode 9 and 10
- Windows:
- Visual Studio 2017, but compile-time parsing sometimes doesn't work (still works at runtime)
Installation
The library is header-only and requires my debug_assert library as well as the (header-only and standalone) Boost.mp11.
Using CMake add_subdirectory()
:
Download and call add_subdirectory()
.
It will look for the dependencies with find_package()
, if they're not found, the git submodules will be used.
Then link to foonathan::foonathan_lex
.
Using CMake find_package()
:
Download and install, setting the CMake variable FOONATHAN_LEX_FORCE_FIND_PACKAGE=ON
.
This requires the dependencies to be installed as well.
Then call find_package(foonathan_lex)
and link to foonathan::foonathan_lex
.
With other buildsystems:
You need to set the following options:
- Enable C++14
- Add the include path, so
#include <debug_assert.hpp>
works - Add the include path, so
#include <boost/mp11/mp11.hpp>
works - Add the include path, so
#include <foonathan/lex/tokenizer.hpp>
works
Planned Features
- Parser on top of the tokenizer
- Integrated way to handle data associated with tokens (like the value of an integer literal)
- Optimize compile-time