Awesome
<!--#region:intro-->ECMAScript Regular Expression Language Features
This seeks to investigate and introduce new features to the ECMAScript RegExp
object based on features available commonly in other languages.
Status
Stage: 0
Champion: Ron Buckton (@rbuckton)
For detailed status of this proposal see TODO, below.
<!--#endregion:status--> <!--#region:authors-->Authors
- Ron Buckton (@rbuckton)
Motivations
ECMAScript regular expressions have slowly improved over the years to adopt new functionality commonly present in other languages, including:
- Unicode Support
- Named Capture Groups
- Match Indices
However, a large majority of other languages and libraries have a common set of features that ECMAScript regular expressions currently lack. Some of these features improve performance in degenerative cases such as backtracking in complex patterns. Some of these features introduce new tools for developers to write more powerful regular expressions.
As a result, ECMAScript developers wishing to leverage these capabilities are left with few options, relying on native bindings to third-party libraries in environments such as NodeJS, or server-side evaluation.
There are numerous applications for extending the ECMAScript regular expression feature set, including:
- In-browser support for TextMate grammars for web based editors/IDEs.
- Improved performance for expressions through possessive quantifiers and backtracking control.
- RegExp-based parsers that can support balanced brackets/parens.
- Documenting complex patterns in the pattern itself.
- Improved readability through the use of multi-line patterns and insignificant whitespace.
Syntax
This proposal seeks to investiage multiple additions to the ECMAScript regular expression syntax based on features commonly available in other languages and engines. This work is based on the research at https://rbuckton.github.io/regexp-features/, which is an ongoing effort to document the commonalities and differences of various features in popular regular expression engines. This proposal does not seek to implement all of the proposed syntax, but to investigate each feature to determine its applicability to ECMAScript. Where possible, we will indicate whether the syntax described should be considered definitive (i.e., the specific syntax is not subject to change should the feature be adopted), or proposed (i.e., the specific syntax is open for debate).
<dfn id="definitive-syntax">Definitive syntax</dfn> is that which is generally-consistent with all engines that implement the functionality, such that a change to the syntax would have a net-negative effect when considering compatibility with other engines (such as would be the case with TextMate grammars, patterns commonly used in documentation to describe a valid input, etc.).
<dfn id="proposed-syntax">Proposed syntax</dfn> is that which is inconsistent between the various engines that implement similar functionality, such that a change to the syntax to fit ECMAScript requirements would not likely be a compatiblity concern.
Flags
Explicit capture mode (n
)
Status: Definitive
Prior Art: Perl, PCRE, .NET (feature comparison)
The explicit capture mode (n
) flag affects capturing behavior, such that normal capture groups (i.e., ()
) are treated as non-capturing groups. Only named capture groups are returned.
NOTE: The
n
-mode flag can be used inside of a Modifier.
API
RegExp.prototype.explicitCapture
(Boolean) — Indicates whether then
-mode flag is set.
Extended mode (x
)
Status: Definitive
Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, Hyperscan, ICU, Glib/GRegex (feature comparison)
The extended mode (x
) flag treats unescaped whitespace characters as insignificant, allowing for multi-line regular expressions. It also enables Line Comments.
NOTE: The
x
-mode flag can be used inside of a Modifier
NOTE: While the
x
-mode flag can be used in a RegularExpressionLiteral, it does not permit the use of LineTerminator in RegularExpressonLiteral. For multi-line regular expressions you would need to use theRegExp
constructor.
NOTE: Perl's original
x
-mode treated whitespace as insignificant anywhere within a pattern except for within character classes. Perl v5.26 introduced thexx
flag which also ignores non-escaped SPACE and TAB characters. Should we chose to adopt thex
-mode flag, we could opt to treat it as Perl'sxx
mode at the outset.
API
RegExp.prototype.extended
(Boolean) — Indicates thex
-mode flag is set.
Modifiers
Status: Definitive
Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, Hyperscan, ICU, Glib/GRegex (feature comparison)
Modifiers allow you to change the currently active RegExp flags within a subexpression.
(?imnsux-imnsux)
— Sets or unsets (using-
) the specified RegExp flags starting at the current position until the next closing)
or the end of the pattern.(?imnsux-imnsux:subexpression)
— Sets or unsets (using-
) the specified RegExp flags for the subexpression.
NOTE: Certain flags cannot be modified mid-expression. These currently include
g
(global),y
(sticky), andd
(hasIndices).
NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both
u
and non-u
modes.
Example
const re1 = /^(?i)[a-z](?-i)[a-z]$/;
re1.test("ab"); // true
re1.test("Ab"); // true
re1.test("aB"); // false
const re2 = /^(?i:[a-z](?-i:[a-z]))$/;
re2.test("ab"); // true
re2.test("Ab"); // true
re2.test("aB"); // false
Comments
Status: Definitive
Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, Hyperscan, ICU, Glib/GRegex (feature comparison)
A comment is a sequence of characters that is ignored by pattern matching and can be used to document a pattern.
(?#comment)
— The entire expression is removed from the pattern. The text of comment may not contain other(
or)
characters.
NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both
u
and non-u
modes.
Example
const re = /foo(?#comment)bar/;
re.test("foobar"); // true
Line Comments
Status: Definitive
Prior Art: Perl, PCRE, .NET, ICU, Glib/GRegex (feature comparison)
A Line Comment is a sequence of characters starting with #
and ending with \n
(or the end of the pattern) that is ignored by pattern matching and can be used to document a pattern.
# comment
— A line comment in a multi-line RegExp
NOTE: Requires the
x
-mode flag.
NOTE: Inside of
x
-mode, the#
character must be escaped (using\#
) outside of a character class.
Example
const re = new RegExp(String.raw`
# match ASCII alpha-numerics
[a-zA-Z0-9]
`, "x");
Buffer Boundaries
Status: Definitive
Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, Hyperscan, ICU, Glib/GRegex (feature comparison)
Buffer boundaries are similar to the ^
and $
anchors, except that they are not affected by the m
(multiline) flag:
\A
— Matches the start of the input.\z
— Matches the end of the input.\Z
— A zero-width assertion consisting of an optional newline at the end of the buffer. Equivalent to (?=\n?\z).
NOTE: Requires the
u
flag, as\A
,\z
, and\Z
are currently just escapes forA
,z
andZ
without theu
flag.
NOTE: Not supported inside of a character class.
Line Endings Escape
Status: Definitive
Prior Art: Perl, PCRE, Boost.Regex, Oniguruma, ICU, Glib/GRegex (feature comparison)
\R
— Matches any line ending character sequence per UTS#18. Equivalent to:(?>\r\n?|[\x0A-\x0C\x85\u{2028}\u{2029}])
(see Atomic Groups)
NOTE: Requires the
u
flag, as\R
is currently just an escape forR
without theu
flag.
NOTE: Not supported inside of a character class.
Possessive Quantifiers
Status: Definitive
Prior Art: Perl, PCRE, Boost.Regex, Oniguruma, ICU, Glib/GRegex (feature comparison)
Possessive quantifiers are like normal (a.k.a. "greedy") quantifiers, but do not backtrack if the rest of the pattern to the right fails to match. Possessive quantifiers are often used as a performance tweak to avoid expensive backtracking in a complex pattern.
*+
— Match zero or more instances of the preceding atom without backtracking.++
— Match one or more instances of the preceding atom without backtracking.?+
— Match zero or one instances of the preceding atom without backtracking.{n,}+
— Where n is an integer. Matches the preceding atom at-least n times without backtracking.{n,m}+
— Where n and m are integers, and m >= n. Matches the preceding atom at-least n times and at-most m times without backtracking.
NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both
u
and non-u
modes.
Atomic Groups
Status: Definitive
Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, ICU, Glib/GRegex (feature comparison)
An Atomic Group is a non-backtracking expression which is matched independent of neighboring patterns, and will not backtrack in the event of a failed match. This is often used to improve performance.
(?>pattern)
— Matches the provided pattern, but no backtracking is performed if the match fails.
NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both
u
and non-u
modes.
Example
// NOTE: x-mode flag used to illustrate difference
// without atomic groups:
const re1 = /\(( [^()]+ | \([^()]*\))+ \)/x;
re1.test("((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); // can take several seconds to fail
// with atomic groups
const re2 = /\(( (?> [^()]+ ) | \([^()]*\))+ \)/x;
re2.test("((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); // significantly faster as less backtracking is involved
Conditional Expressions
Status: Definitive/Proposed (depending on condition, see below)
Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, Glib/GRegex (feature comparison)
A Conditional Expression checks a condition and evaluates its first alternative if the condition is true; otherwise, it evaluates its second alternative.
(?(condition)yes-pattern|no-pattern)
— Matches yes-pattern if condition is true; otherwise, matches no-pattern.(?(condition)yes-pattern)
— Matches yes-pattern if condition is true; otherwise, matches the empty string.
NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both
u
and non-u
modes.
Conditions
The following conditions are proposed:
(?=test-pattern)
— Evaluates to true if a positive lookahead for test-pattern matches; Otherwise, evaluates to false.- Status: Definitive
(?<=test-pattern)
— Evaluates to true if a positive lookbehind for test-pattern matches; Otherwise, evaluates to false.- Status: Proposed (backtracking not supported in all engines)
(?!test-pattern)
— Evaluates to true if a negative lookahead for test-pattern matches; Otherwise, evaluates to false.- Status: Definitive
(?<!test-pattern)
— Evaluates to true if a negative lookbehind for test-pattern matches; Otherwise, evaluates to false.- Status: Proposed (backtracking not supported in all engines)
(n)
— Evaluates to true if the capture group at offset n was successfully matched; Otherwise, evaluates to false.- Status: Definitive
(<name>)
— Evaluates to true if the named capture group with the provided name was successfully matched; Otherwise, evaluates to false.- Status: Definitive
('name')
— Evaluates to true if the named capture group with the provided name was successfully matched; Otherwise, evaluates to false.- Status: Definitive
(R)
— Evaluates to true if inside a recursive expression; Otherwise, evaluates to false.(Rn)
— Evaluates to true if inside a recursive expression for the capture group at offset n; Otherwise, evaluates to false.(R&name)
— Evaluates to true if inside a recursive expression for the named capture group with the provided name; Otherwise, evaluates to false.(DEFINE)
— Always evaluates to false. This allows you to define Subroutines.- Status: Proposed (subroutines not supported in all engines)
Example
// conditional using lookahead:
const re1 = /^(?(?=\{)\{[0-9a-f]+\}|[0-9a-f]{4})$/
re1.test("0000"); // true
re1.test("{0}"); // true
re1.test("{00000000}"); // true
// match optional brackets
const re2 = /(?<open-bracket>\[)?(?<content>[^\]]+)(?(<open-bracket>)\]))/;
re1.test("abc"); // true
re1.test("[abc]"); // true
re1.test("[abc"); // false
Subroutines
Status: Proposed (some engines use differing syntax)
Prior Art: Perl, PCRE, Boost.Regex, Oniguruma, Glib/GRegex (feature comparison)
A Subroutine is a pre-defined capture group or named capture group that can be reused in multiple places within the pattern to re-evaluate the subexpression from the referenced group.
(?n)
— Where n is an integer >= 1. Evaluates the capture group whose offset is n.(?-n)
— Where n is an integer >= 1. Evaluates the capture group whose offset is the n<sup>th</sup> capture group declared to the left of the current atom.(?+n)
— Where n is an integer >= 1. Evaluates the capture group whose offset is the n<sup>th</sup> capture group declared to the right of the current atom.(?&name)
— Evaluates the named capture group with the provided name.
NOTE: Subroutines also allow Recursion.
NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both
u
and non-u
modes.
Example
const iso8601DateRegExp = new RegExp(String.raw`
(?(DEFINE)
(?<Year>\d{4}|[+-]\d{5,})
(?<Month>0[1-9]|1[0-2])
(?<Day>0[1-9]|2[0-9]|3[01])
)
(?<Date>(?&Year)-(?&Month)-(?&Day)|(?&Year)(?&Month)(?&Day))
`, "x");
Recursion
Status: Proposed (some engines use differing syntax)
Prior Art: Perl, PCRE, Boost.Regex, Oniguruma, Glib/GRegex (feature comparison)
A Recursive Expression provides a mechanism for re-evaluating a capture group inside of itself, to handle cases such as matching balanced parenthesis or brackets, etc.
(?R)
,(?0)
— Reevaluates the entire pattern starting at the current position.
<!--#endregion:syntax--> <!--#region:semantics--> <!-- # Semantics --> <!--#endregion:semantics--> <!--#region:api--> <!-- # API --> <!--#endregion:api--> <!--#region:examples--> <!-- # Examples --> <!--#endregion:examples--> <!--#region:grammar--> <!-- # Grammar > TODO: Provide the grammar for the proposal. Please use [grammarkdown][Grammarkdown] syntax in > fenced code blocks as grammarkdown is the grammar format used by ecmarkup. ```grammarkdown ``` --> <!--#endregion:grammar--> <!--#region:references-->NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both
u
and non-u
modes.
References
<!--#endregion:references--> <!--#region:prior-discussion--> <!-- # Prior Discussion > TODO: Provide links to prior discussion topics on https://esdiscuss.org. * [Subject](https://esdiscuss.org) --> <!--#endregion:prior-discussion--> <!--#region:todo-->TODO
The following is a high-level list of tasks to progress through each stage of the TC39 proposal process:
Stage 1 Entrance Criteria
- Identified a "champion" who will advance the addition.
- Prose outlining the problem or need and the general shape of a solution.
- Illustrative examples of usage.
- High-level API.
Stage 2 Entrance Criteria
- Initial specification text.
- Transpiler support (Optional).
Stage 3 Entrance Criteria
- Complete specification text.
- Designated reviewers have signed off on the current spec text.
- The ECMAScript editor has signed off on the current spec text.
Stage 4 Entrance Criteria
- Test262 acceptance tests have been written for mainline usage scenarios and merged.
- Two compatible implementations which pass the acceptance tests:
- A pull request has been sent to tc39/ecma262 with the integrated spec text.
- The ECMAScript editor has signed off on the pull request.