Awesome
Well-formed JSON.stringify
A proposal to prevent JSON.stringify
from returning ill-formed Unicode strings.
Status
This proposal is at stage 4 of the TC39 Process.
Champions
- Mathias Bynens
Motivation
RFC 8259 section 8.1 requires JSON text exchanged outside the scope of a closed ecosystem to be encoded using UTF-8, but JSON.stringify
can return strings including code points that have no representation in UTF-8 (specifically, surrogate code points U+D800 through U+DFFF).
And contrary to the description of JSON.stringify
, such strings are not "in UTF-16" because "isolated UTF-16 code units in the range D800₁₆..DFFF₁₆ are ill-formed" per The Unicode Standard, Version 10.0.0, Section 3.4 at definition D91 and excluded from being "in UTF-16" per definition D89.
However, returning such invalid Unicode strings is unnecessary, because JSON strings can include Unicode escape sequences.
Proposed Solution
Rather than return unpaired surrogate code points as single UTF-16 code units, represent them with JSON escape sequences.
Illustrative examples
// Non-BMP characters still serialize to surrogate pairs.
JSON.stringify('𝌆')
// → '"𝌆"'
JSON.stringify('\uD834\uDF06')
// → '"𝌆"'
// Unpaired surrogate code units will serialize to escape sequences.
JSON.stringify('\uDF06\uD834')
// → '"\\udf06\\ud834"'
JSON.stringify('\uDEAD')
// → '"\\udead"'
Discussion
Backwards Compatibility
This change is backwards-compatible, under an assumption of consumer compliance with the JSON specification.
User-visible effects will be limited to the replacement of some rare single UTF-16 code units in JSON.stringify
output with equivalent six-character escape sequences that can be represented both in UTF-16 and in UTF-8.
It is the authors' opinion that any consumer accepting the current ill-formed output will be unaffected by this change (this is true in particular of ECMAScript JSON.parse
).
Any consumer rejecting the current ill-formed output will have a new opportunity to accept its well-formed representation, although such consumers may still reject input that specifies strings including Unicode code points that are not scalar values (e.g., because they only accept I-JSON input), but those that accept it must have mechanisms for dealing with unpaired surrogates (as mentioned in the specification of JSON).
Validity
Unicode escape sequences are valid JSON, and—being completely ASCII—are well-formed in both UTF-16 and UTF-8.
Specification
The specification is available in ecmarkup or rendered HTML.
Implementations
- V8, enabled by default in V8 v7.2.10 and Chrome 72
- SpiderMonkey, shipping in Firefox 64
- JavaScriptCore, shipping in Safari Technology Preview 73