Awesome
encoding_bench
This is a performance testing framework for understanding the performance characteristics of encoding_rs.
This framework is separate from encoding_rs itself in order to separate
nightly-only Rust features (#[bench]
) and CC-licensed test data from the main
repository.
Licensing
Please see the file named COPYRIGHT.
Building
Currently, this project is tested to build only on Ubuntu 18.04.
uconv
Before building this project with the uconv
cargo feature, make a Firefox
optimized build at with the patch gecko.patch
(from this directory) applied
to mozilla-central as of November 2016 and Rust enabled
(ac_add_options --enable-rust
in your mozconfig
). This worked on Ubuntu
16.04 and doesn't appear to work without extra effort on Ubuntu 18.04 due
to changes in the C++ standard library. The patch breaks the ability to run
the resulting Firefox build normally and breaks the packaging that would
happen on Mozilla's try server. For the latter reason, you need to build locally.
Once you have the custom Gecko build available, build this project with
LIBRARY_PATH=/path-to-gecko-obj-dir/dist/sdk/lib:/path-to-gecko-obj-dir/dist/bin LD_LIBRARY_PATH=/path-to-gecko-obj-dir/dist/bin cargo bench
If the build is successful, it's a good idea to append 2> /dev/null
for the
actual benchmarking runs to hide noise from Gecko.
kewb
To use the kewb
cargo feature for Bob Steagall's SSE2-accelerated UTF-8
to UTF-16 converter, clone https://github.com/hsivonen/kewb
, build it
and put that directory in LIBRARY_PATH
when building encoding_bench.
webkit
To use the webkit
cargo feature, build webkitgtk-2.22.2 with the patch
webkit.patch
from this directory applied and make the resulting .so
available in the library search paths as in the unconv case above.
Selection of test data
For testing decoding, it's important to have test data that's real-world Web content in order to have a real-world interleaving of markup and non-markup.
Unicode.org's translations of the Universal Declaration of Human Rights have less markup than one would expect from Web content in general. (Also, the copyright status of the translations wasn't obvious at a glance.)
Using Google Translate to synthetize content in various languages doesn't work, because Google Translate adds its own markup, which messes up the natural interleaving of ASCII and non-ASCII in real-world Web content.
Reasons for choosing Wikipedia were:
- Wikipedia is an actual top site that's relevant to users.
- Wikipedia has content in all the languages that were relevant for testing.
- Wikipedia content is human-authored.
- Wikipedia content is suitably licensed.
The topic Mars, the planet, was chosen, because it is the most-featured topic across the different-language Wikipedias and, indeed, had non-trivial articles in all the languages needed. Trying to choose a typical-length article for each language separately wasn't feasible in the Wikidata data set.
For x-user-defined, a binary file is used instead of text, because the use case
for x-user-defined is loading binary data using XHR (in pre-ArrayBuffer
code).
For testing encoders, the relevant cases are URL parsing (almost always ASCII), form submission (typically mostly human-readable text) and POSTing stuff using XHR (UTF-16 to UTF-8 encode only). Because it was too troublesome to find real-world workloads representing POSTing stuff using XHR and because URL parsing is almost always ASCII, the form submission case is measured even though (except when encoding to UTF-8), encoding_rs explicitly doesn't attempt to optimize that case for speed but for size. The test data is a plain-text extract from each corresponding HTML decoder test file.