Home

Awesome

chardetng

crates.io docs.rs Apache 2 / MIT dual-licensed

A character encoding detector for legacy Web content.

Licensing

Please see the file named COPYRIGHT.

Documentation

Generated API documentation is available online.

There is a long-form write-up about the design and motivation of the crate.

Purpose

The purpose of this detector is user retention for Firefox by ensuring that the long tail of the legacy Web is not more convenient to use in Chrome than in Firefox. (Chrome deployed ced, which left Firefox less convenient to use until the deployment of this detector.)

About the Name

chardet was the name of Mozilla's old encoding detector. I named this one chardetng, because this the next generation of encoding detector in Firefox. There is no code reuse from the old chardet.

Optimization Goals

This crate aims to be more accurate than ICU, more complete than chardet, more explainable and modifiable than compact_enc_det (aka. ced), and, in an application that already depends on encoding_rs for other reasons, smaller in added binary footprint than compact_enc_det.

Rayon support

Enabling the optional feature multithreading makes chardetng run the detectors for individual encodings in parallel. Unfortunately, the performance doesn't scale linearly with CPU cores, but it's still better than single-threaded performance in terms of wall-clock time if a single instance of chardetng is running. In terms of combined CPU core usage, the multithreading mode is quite a bit worse than the single-threaded more, so if you can find a parallelization point at some higher-level task such that you could have multiple instances of chardetng running in paraller each on a single thread, you'll get better results doing that.

no_std support

chardetng works in a no_std environment that does not have an allocator.

Principle of Operation

In general chardetng prefers to do negative matching (rule out possibilities from the set of plausible encodings) than to do positive matching. Since negative matching is insufficient, there is positive matching, too.

Notes About Encodings

<dl> <dt>UTF-8</dt> <dd>Detected only if explicitly permitted by the argument to the `guess` method. It's harmful for Web browsers to detect UTF-8 without requiring user action, such as choosing a menu item, because Web developers would start relying on the detection.</dd> <dt>UTF-16[BE|LE]</dt> <dd>Not detected: Detecting these belongs on the BOM layer.</dd> <dt>x-user-defined</dt> <dd>Not detected: This encoding is for XHR. <code>&lt;meta charset=x-user-defined></code> in HTML is not unlabeled and means windows-1252.</dd> <dt>Replacement</dt> <dd>Not detected.</dd> <dt>GB18030</dt> <dd>Detected as GBK.</dd> <dt>GBK</dt> <dt>Big5</dt> <dt>EUC-KR</dt> <dt>Shift_JIS</dt> <dt>windows-1250</dt> <dt>windows-1251</dt> <dt>windows-1252</dt> <dt>windows-1253</dt> <dt>windows-1254</dt> <dt>windows-1255</dt> <dt>windows-1256</dt> <dt>windows-1257</dt> <dt>windows-1258</dt> <dt>windows-874</dt> <dt>ISO-8859-2</dt> <dt>ISO-8859-7</dt> <dd>Detected: Historical locale-specific fallbacks.</dd> <dt>EUC-JP</dt> <dt>ISO-2022-JP</dt> <dt>KOI8-U</dt> <dt>ISO-8859-5</dt> <dt>IBM866</dt> <dd>Detected: Detected by multiple browsers past and present.</dd> <dt>KOI8-R</dt> <dd>Detected as KOI8-U. (Always guessing the U variant is less likely to corrupt non-box drawing characters.)</dd> <dt>ISO-8859-8-I</dt> <dd>Detected as windows-1255.</dd> <dt>ISO-8859-4</dt> <dd>Detected: Detected by IE and Chrome; in menu in IE and Firefox.</dd> <dt>ISO-8859-6</dt> <dd>Detected: Detected by IE and Chrome.</dd> <dt>ISO-8859-8</dt> <dd>Detected: Available in menu in IE and Firefox.</dd> <dt>ISO-8859-13</dt> <dd>Detected: Detected by Chrome. This encoding is so similar to windows-1257 that menu items for windows-1257 can be considered to accommodate this one in IE and Firefox. Due to the mechanics of this detector, if this wasn't included as a separate item, the windows-1257 detection wouldn't catch the cases that use curly quotes and are invalid as windows-1257.</dd> <dt>x-mac-cyrillic</dt> <dd>Not detected: Not detected by IE and Chrome. (Was previously detected by Firefox.)</dd> <dt>ISO-8859-3</dt> <dt>ISO-8859-10</dt> <dt>ISO-8859-14</dt> <dt>ISO-8859-15</dt> <dt>ISO-8859-16</dt> <dt>macintosh</dt> <dd>Not detected: These encodings have never been a locale-specific fallback in a major browser or a menu item in IE.</dd> </dl>

Known Problems

Associated tools

Roadmap

No planned improvements.

Release Notes

0.1.17

0.1.16

0.1.15

0.1.14

0.1.13

0.1.12

0.1.11

0.1.10

0.1.9

0.1.8

0.1.7

0.1.6

0.1.5

0.1.4

0.1.3

0.1.2

0.1.1

0.1.0