Home

Awesome

<!--#region:intro-->

ECMAScript Regular Expression Language Features

This seeks to investigate and introduce new features to the ECMAScript RegExp object based on features available commonly in other languages.

<!--#endregion:intro--> <!--#region:status-->

Status

Stage: 0
Champion: Ron Buckton (@rbuckton)

For detailed status of this proposal see TODO, below.

<!--#endregion:status--> <!--#region:authors-->

Authors

<!--#endregion:authors--> <!--#region:motivations-->

Motivations

ECMAScript regular expressions have slowly improved over the years to adopt new functionality commonly present in other languages, including:

However, a large majority of other languages and libraries have a common set of features that ECMAScript regular expressions currently lack. Some of these features improve performance in degenerative cases such as backtracking in complex patterns. Some of these features introduce new tools for developers to write more powerful regular expressions.

As a result, ECMAScript developers wishing to leverage these capabilities are left with few options, relying on native bindings to third-party libraries in environments such as NodeJS, or server-side evaluation.

There are numerous applications for extending the ECMAScript regular expression feature set, including:

<!--#endregion:motivations--> <!--#region:prior-art--> <!-- # Prior Art - Language: [Feature](#todo) --> <!--#endregion:prior-art--> <!--#region:syntax-->

Syntax

This proposal seeks to investiage multiple additions to the ECMAScript regular expression syntax based on features commonly available in other languages and engines. This work is based on the research at https://rbuckton.github.io/regexp-features/, which is an ongoing effort to document the commonalities and differences of various features in popular regular expression engines. This proposal does not seek to implement all of the proposed syntax, but to investigate each feature to determine its applicability to ECMAScript. Where possible, we will indicate whether the syntax described should be considered definitive (i.e., the specific syntax is not subject to change should the feature be adopted), or proposed (i.e., the specific syntax is open for debate).

<dfn id="definitive-syntax">Definitive syntax</dfn> is that which is generally-consistent with all engines that implement the functionality, such that a change to the syntax would have a net-negative effect when considering compatibility with other engines (such as would be the case with TextMate grammars, patterns commonly used in documentation to describe a valid input, etc.).

<dfn id="proposed-syntax">Proposed syntax</dfn> is that which is inconsistent between the various engines that implement similar functionality, such that a change to the syntax to fit ECMAScript requirements would not likely be a compatiblity concern.

Flags

Explicit capture mode (n)

Status: Definitive

Prior Art: Perl, PCRE, .NET (feature comparison)

The explicit capture mode (n) flag affects capturing behavior, such that normal capture groups (i.e., ()) are treated as non-capturing groups. Only named capture groups are returned.

NOTE: The n-mode flag can be used inside of a Modifier.

API

Extended mode (x)

Status: Definitive

Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, Hyperscan, ICU, Glib/GRegex (feature comparison)

The extended mode (x) flag treats unescaped whitespace characters as insignificant, allowing for multi-line regular expressions. It also enables Line Comments.

NOTE: The x-mode flag can be used inside of a Modifier

NOTE: While the x-mode flag can be used in a RegularExpressionLiteral, it does not permit the use of LineTerminator in RegularExpressonLiteral. For multi-line regular expressions you would need to use the RegExp constructor.

NOTE: Perl's original x-mode treated whitespace as insignificant anywhere within a pattern except for within character classes. Perl v5.26 introduced the xx flag which also ignores non-escaped SPACE and TAB characters. Should we chose to adopt the x-mode flag, we could opt to treat it as Perl's xx mode at the outset.

API

Modifiers

Status: Definitive

Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, Hyperscan, ICU, Glib/GRegex (feature comparison)

Modifiers allow you to change the currently active RegExp flags within a subexpression.

NOTE: Certain flags cannot be modified mid-expression. These currently include g (global), y (sticky), and d (hasIndices).

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

Example

const re1 = /^(?i)[a-z](?-i)[a-z]$/;
re1.test("ab"); // true
re1.test("Ab"); // true
re1.test("aB"); // false

const re2 = /^(?i:[a-z](?-i:[a-z]))$/;
re2.test("ab"); // true
re2.test("Ab"); // true
re2.test("aB"); // false

Comments

Status: Definitive

Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, Hyperscan, ICU, Glib/GRegex (feature comparison)

A comment is a sequence of characters that is ignored by pattern matching and can be used to document a pattern.

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

Example

const re = /foo(?#comment)bar/;
re.test("foobar"); // true

Line Comments

Status: Definitive

Prior Art: Perl, PCRE, .NET, ICU, Glib/GRegex (feature comparison)

A Line Comment is a sequence of characters starting with # and ending with \n (or the end of the pattern) that is ignored by pattern matching and can be used to document a pattern.

NOTE: Requires the x-mode flag.

NOTE: Inside of x-mode, the # character must be escaped (using \#) outside of a character class.

Example

const re = new RegExp(String.raw`
    # match ASCII alpha-numerics
    [a-zA-Z0-9]
`, "x");

Buffer Boundaries

Status: Definitive

Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, Hyperscan, ICU, Glib/GRegex (feature comparison)

Buffer boundaries are similar to the ^ and $ anchors, except that they are not affected by the m (multiline) flag:

NOTE: Requires the u flag, as \A, \z, and \Z are currently just escapes for A, z and Z without the u flag.

NOTE: Not supported inside of a character class.

Line Endings Escape

Status: Definitive

Prior Art: Perl, PCRE, Boost.Regex, Oniguruma, ICU, Glib/GRegex (feature comparison)

NOTE: Requires the u flag, as \R is currently just an escape for R without the u flag.

NOTE: Not supported inside of a character class.

Possessive Quantifiers

Status: Definitive

Prior Art: Perl, PCRE, Boost.Regex, Oniguruma, ICU, Glib/GRegex (feature comparison)

Possessive quantifiers are like normal (a.k.a. "greedy") quantifiers, but do not backtrack if the rest of the pattern to the right fails to match. Possessive quantifiers are often used as a performance tweak to avoid expensive backtracking in a complex pattern.

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

Atomic Groups

Status: Definitive

Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, ICU, Glib/GRegex (feature comparison)

An Atomic Group is a non-backtracking expression which is matched independent of neighboring patterns, and will not backtrack in the event of a failed match. This is often used to improve performance.

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

Example

// NOTE: x-mode flag used to illustrate difference
// without atomic groups:
const re1 = /\((      [^()]+   | \([^()]*\))+ \)/x;
re1.test("((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); // can take several seconds to fail

// with atomic groups
const re2 = /\((  (?> [^()]+ ) | \([^()]*\))+ \)/x;
re2.test("((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); // significantly faster as less backtracking is involved

Conditional Expressions

Status: Definitive/Proposed (depending on condition, see below)

Prior Art: Perl, PCRE, Boost.Regex, .NET, Oniguruma, Glib/GRegex (feature comparison)

A Conditional Expression checks a condition and evaluates its first alternative if the condition is true; otherwise, it evaluates its second alternative.

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

Conditions

The following conditions are proposed:

Example

// conditional using lookahead:
const re1 = /^(?(?=\{)\{[0-9a-f]+\}|[0-9a-f]{4})$/
re1.test("0000"); // true
re1.test("{0}"); // true
re1.test("{00000000}"); // true

// match optional brackets
const re2 = /(?<open-bracket>\[)?(?<content>[^\]]+)(?(<open-bracket>)\]))/;
re1.test("abc"); // true
re1.test("[abc]"); // true
re1.test("[abc"); // false

Subroutines

Status: Proposed (some engines use differing syntax)

Prior Art: Perl, PCRE, Boost.Regex, Oniguruma, Glib/GRegex (feature comparison)

A Subroutine is a pre-defined capture group or named capture group that can be reused in multiple places within the pattern to re-evaluate the subexpression from the referenced group.

NOTE: Subroutines also allow Recursion.

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

Example

const iso8601DateRegExp = new RegExp(String.raw`
  (?(DEFINE)
    (?<Year>\d{4}|[+-]\d{5,})
    (?<Month>0[1-9]|1[0-2])
    (?<Day>0[1-9]|2[0-9]|3[01])
  )
  (?<Date>(?&Year)-(?&Month)-(?&Day)|(?&Year)(?&Month)(?&Day))
`, "x");

Recursion

Status: Proposed (some engines use differing syntax)

Prior Art: Perl, PCRE, Boost.Regex, Oniguruma, Glib/GRegex (feature comparison)

A Recursive Expression provides a mechanism for re-evaluating a capture group inside of itself, to handle cases such as matching balanced parenthesis or brackets, etc.

NOTE: This has no conflicts with existing syntax, as ECMAScript currently produces an error for this syntax in both u and non-u modes.

<!--#endregion:syntax--> <!--#region:semantics--> <!-- # Semantics --> <!--#endregion:semantics--> <!--#region:api--> <!-- # API --> <!--#endregion:api--> <!--#region:examples--> <!-- # Examples --> <!--#endregion:examples--> <!--#region:grammar--> <!-- # Grammar > TODO: Provide the grammar for the proposal. Please use [grammarkdown][Grammarkdown] syntax in > fenced code blocks as grammarkdown is the grammar format used by ecmarkup. ```grammarkdown ``` --> <!--#endregion:grammar--> <!--#region:references-->

References

<!--#endregion:references--> <!--#region:prior-discussion--> <!-- # Prior Discussion > TODO: Provide links to prior discussion topics on https://esdiscuss.org. * [Subject](https://esdiscuss.org) --> <!--#endregion:prior-discussion--> <!--#region:todo-->

TODO

The following is a high-level list of tasks to progress through each stage of the TC39 proposal process:

Stage 1 Entrance Criteria

Stage 2 Entrance Criteria

Stage 3 Entrance Criteria

Stage 4 Entrance Criteria

<!--#endregion:todo-->