Home

Awesome

<!--#region:intro-->

Regular Expression Atomic Operators for ECMAScript

This proposal seeks to introduce syntax to ECMAScript regular expressions to control backtracking in certain scenarios by treating certain portions of a pattern as "atomic", where backtracking information specific to that portion of the pattern is discarded when it matches successfully.

For example, currently the pattern a(bc|b)c will match both "abcc" and "abc". In the case of "abcc", we match the term a, then the alternative bc, and finally the term c. In the case of "abc", we match the term a, then the alternative bc (consuming the rest of the input), but fail to match the final term c. Since the rest of the match failed, we backtrack to the beginning of the disjunction bc|b and attempt the alternative. We then match the alternative b, and finally the term c.

In some cases such backtracking is unwanted or is potentially catastrophic. To address such cases, we propose the addition of Atomic Groups ((?> Disjunction )) and Possessive Quantifiers (n*+, n++, etc.). In an Atomic Group we discard backtracking information for the group when it matches successfully. If the rest of the pattern following the atomic group fails to match, we would no longer backtrack to the beginning of the group in an attempt to process any alternatives.

In the case of a(?>bc|b)c, we would still match "abcc" but would no longer match "abc". In the case of "abc", we match the term a, then the alternative bc, but fail to match the term c. Since backtracking information for the disjunction bc|b was discarded upon the successful match of bc, we cannot attempt the alternative b. As a result, the match for "abc" fails.

It is worth noting that lookaround operators (i.e., (?=), (?!), (?<=), and (?<!)) are already atomic operations.

<!--#endregion:intro--> <!--#region:status-->

Status

Stage: 1
Champion: Ron Buckton (@rbuckton)

For detailed status of this proposal see TODO, below.

<!--#endregion:status--> <!--#region:authors-->

Authors

<!--#endregion:authors--> <!--#region:motivations-->

Motivations

NOTE: See https://github.com/rbuckton/proposal-regexp-features for an overview of how this proposal fits into other possible future features for Regular Expressions.

This proposal seeks to introduce new syntax to allow users fine control the performance characteristics of regular expressions through possessive quantifiers and backtracking control.

<!--#endregion:motivations--> <!--#region:prior-art-->

Prior Art

See https://rbuckton.github.io/regexp-features/features/possessive-quantifiers.html and https://rbuckton.github.io/regexp-features/features/non-backtracking-expressions.html for additional information.

<!--#endregion:prior-art--> <!--#region:syntax-->

Syntax

Atomic Groups

An Atomic Group is a non-backtracking expression which is matched independent of neighboring patterns, and will not backtrack in the event of a failed match. This is often used to improve performance.

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

Possessive Quantifiers

Possessive quantifiers are like normal (a.k.a. "greedy") quantifiers, but do not backtrack if the rest of the pattern to the right fails to match. Possessive quantifiers are often used as a performance tweak to avoid catastrophic backtracking.

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

Possessive quantifiers are syntactic sugar over (?> Disjunction ):

<!--#endregion:syntax--> <!--#region:semantics--> <!-- # Semantics --> <!--#endregion:semantics--> <!--#region:examples-->

Examples

// NOTE: x-mode flag used to illustrate difference
// without atomic groups:
const re1 = /\((      [^()]+   | \([^()]*\))+ \)/x;
re1.test("((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); // can take several seconds to fail

// with atomic groups
const re2 = /\((  (?> [^()]+ ) | \([^()]*\))+ \)/x;
//                ^^^--------^ atomic group
re2.test("((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); // significantly faster as less backtracking is involved

// with posessive quantifiers
const re2 = /\((      [^()]++  | \([^()]*\))+ \)/x;
//                         ^^- possessive quantifier
re2.test("((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); // significantly faster as less backtracking is involved

See https://github.com/rbuckton/proposal-regexp-x-mode for more information on the x flag.

<!--#endregion:examples--> <!--#region:api--> <!-- # API --> <!--#endregion:api--> <!--#region:grammar--> <!-- # Grammar ```grammarkdown ``` --> <!--#endregion:grammar--> <!--#region:references--> <!-- # References > TODO: Provide links to other specifications, etc. * [Title](url) --> <!--#endregion:references-->

History

<!--#region:todo-->

TODO

The following is a high-level list of tasks to progress through each stage of the TC39 proposal process:

Stage 1 Entrance Criteria

Stage 2 Entrance Criteria

Stage 3 Entrance Criteria

Stage 4 Entrance Criteria

<!--#endregion:todo--> <!-- The following links are used throughout the README: -->