Home

Awesome

STFU-8: Sorta Text Format in UTF-8

Build Status

STFU-8 is a hacky text encoding/decoding protocol for data that might be not quite UTF-8 but is still mostly UTF-8. It is based on the syntax of the repr created when you write (or print) binary text in rust, python, C or other common programming languages.

Its primary purpose is to be able to allow a human to visualize and edit "data" that is mostly (or fully) visible UTF-8 text. It encodes all non visible or non UTF-8 compliant bytes as longform text (i.e. ESC becomes the full string r"\x1B"). It can also encode/decode ill-formed UTF-16.

Comparision to other formats:

Specification

In simple terms, encoded STFU-8 is itself always valid unicode which decodes to binary (the binary is not necessarily UTF-8). It differs from unicode in that single \ items are illegal. The following patterns are legal:

stfu8 provides 2 different categories of functions for encoding/decoding data that are not necessarily interoperable (don't decode output created from encode_u8 with decode_u16).

There are some general rules for encoding and decoding:

tab, newline, and line-feed characters are "visible", so encoding with them in "pretty form" is optional.

UTF-16 Ill Formed Text

The problem is succinctly stated here:

http://unicode.org/faq/utf_bom.html

Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?

A different issue arises if an unpairedsurrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a convertermust treat this as an error. [AF]

Also, from the WTF-8 spec

As a result, [unpaired] surrogates do occur in practice and need to be preserved. For example:

In ECMAScript (a.k.a. JavaScript), a String value is defined as a sequence of 16-bit integers that usually represents UTF-16 text but may or may not be well-formed. Windows applications normally use UTF-16, but the file system treats path and file names as an opaque sequence of WCHARs (16-bit code units).

We say that strings in these systems are encoded in potentially ill-formed UTF-16 or WTF-16.

Basically: you can't (always) convert from UTF-16 to UTF-8 and it's a real bummer. WTF-8, while kindof an answer to this problem, doesn't allow me to serialize UTF-16 into a UTF-8 format, send it to my webapp, edit it (as a human), and send it back. That is what STFU-8 is for.

LICENSE

The source code in this repository is Licensed under either of

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

The STFU-8 protocol/specification itself (including the name) is licensed under CC0 Community commons and anyone should be able to reimplement or change it for any purpose without need of attribution. However, using the same name for a completely different protocol would probably confuse people so please don't do it.