Home

Awesome

Note: This project is deprecated in favor of node-unicode-data.

Unicode test data for JavaScript

If you ever need JavaScript arrays of all Unicode symbols per category per Unicode version (for testing purposes, perhaps), or JavaScript-compatible regular expressions to match those symbols, this directory has got you covered. Because of the way JavaScript exposes “characters”, generating this data is trickier than it sounds, as you have to account for surrogate pairs.

For example, I’ve used a variation of this data in the following test case: http://mathias.html5.org/tests/javascript/identifiers/ It dynamically creates and runs over 90k tests, based on the appropriate Unicode categories and symbols.

Generated data

Per Unicode category, a number of separate files will be created:

The same thing is done for scripts, blocks, and properties.

The data is currently being generated for the following Unicode versions:

I’ll update this repository (and this list) as soon as new Unicode versions are released.

How to generate the data

I’ve included the Python (v2.7.1) and Bash (v3.2.48) scripts I wrote to generate these files. I’m new to Python, so suggestions on how to improve these scripts are more than welcome!

To (re-)generate all data in this repository, run:

./generate.sh

Tests for the generated data

The generated data is fully tested by a script that verifies that, within the range of code points from 0x000000 to 0x10FFFF, only the symbols in ${version}/${category}-symbols.js are matched by the regular expression in ${version}/${category}-regex.js. This rather heavy test case (which runs over 33 million assertions) is available online.

HTTP API

I’ve set up an HTTP API of sorts, which allows you to customize the output a little bit. This saves you from downloading, editing, and re-hosting the generated files if you just want to write some quick tests. Here’s an example:

http://mathias.html5.org/data/unicode/format?version=6.2.0&category=Ll&type=symbols&prepend=window.symbols%20%3D%20&append=%3B

Available query string parameters

Credits

Thanks to:

Author

Mathias Bynens