Home

Awesome

Build Status

Fast UTF-8 validation with Range algorithm (NEON+SSE4+AVX2)

This is a brand new algorithm to leverage SIMD for fast UTF-8 string validation. Both NEON(armv8a) and SSE4 versions are implemented. AVX2 implementation contributed by ioioioio.

Four UTF-8 validation methods are compared on both x86 and Arm platforms. Benchmark result shows range base algorithm is the best solution on Arm, and achieves same performance as Lemire's approach on x86.

About the code

Benchmark result (MB/s)

Method

  1. Generate UTF-8 test buffer per test file or buffer size.
  2. Call validation sub-routines in a loop until 1G bytes are checked.
  3. Calculate speed(MB/s) of validating UTF-8 strings.

NEON(armv8a)

Test casenaivelookuplemirerangerange2
UTF-demo.txt562.25412.841198.501411.721579.85
32 bytes651.55441.70891.381003.951043.58
33 bytes660.00446.78588.771009.311048.12
129 bytes771.89402.55938.071283.771401.76
1K bytes811.92411.581188.961398.151560.23
8K bytes812.25412.741198.901412.181580.65
64K bytes817.35412.241200.201415.111583.86
1M bytes815.70411.931200.931415.651585.40

SSE4(E5-2650)

Test casenaivelookuplemirerangerange2
UTF-demo.txt753.70310.413954.743945.603986.13
32 bytes1135.76364.072890.522351.812173.02
33 bytes1161.85376.291352.952239.552041.43
129 bytes1161.22322.472742.493315.333249.35
1K bytes1310.95310.723755.883781.233874.17
8K bytes1348.32307.933860.713922.813968.93
64K bytes1301.34308.393935.153973.503983.44
1M bytes1279.78309.063923.513953.003960.49

Range algorithm analysis

Basic idea:

UTF-8 coding format

http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf, page 94

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code PointsFirst ByteSecond ByteThird ByteFourth Byte
U+0000..U+007F00..7F
U+0080..U+07FFC2..DF80..BF
U+0800..U+0FFFE0A0..BF80..BF
U+1000..U+CFFFE1..EC80..BF80..BF
U+D000..U+D7FFED80..9F80..BF
U+E000..U+FFFFEE..EF80..BF80..BF
U+10000..U+3FFFFF090..BF80..BF80..BF
U+40000..U+FFFFFF1..F380..BF80..BF80..BF
U+100000..U+10FFFFF480..8F80..BF80..BF

To summarise UTF-8 encoding:

Range table

Range table maps range index 0 ~ 15 to minimal and maximum values allowed. Our task is to observe input string, find the pattern and set correct range index for each byte, then validate input string.

IndexMinMaxByte type
0007FFirst Byte, ASCII
1,2,380BFSecond, Third, Fourth Bytes
4A0BFSecond Byte after E0
5809FSecond Byte after ED
690BFSecond Byte after F0
7808FSecond Byte after F4
8C2F4First Byte, non-ASCII
9..15(NEON)FF00Illegal: unsigned char >= 255 && unsigned char <= 0
9..15(SSE)7F80Illegal: signed char >= 127 && signed char <= -128

Calculate byte ranges (ignore special cases)

Ignoring the four special cases(E0,ED,F0,F4), how should we set range index for each byte?

To implement above operations efficiently with SIMD:

Example(assume no previous data)

InputF180808080C28080...
first_len30000100...
First Byte80000800...
Second Byte03000010...
Third Byte00200000...
Fourth Byte00010000...
Range index83210810...
Range_index = First_Byte | Second_Byte | Third_Byte | Fourth_Byte

Error handling

Overlapped non-ASCII First Byte

InputF180C290
first_len3010
First Byte8080
Second Byte0301
Third Byte0020
Fourth Byte0001
Range index83101

Adjust Second Byte range for special cases

Range index adjustment for four special cases

First ByteSecond ByteBefore adjustmentCorrect indexAdjustment
E0A0..BF242
ED80..9F253
F090..BF363
F480..8F374

Range index adjustment can be reduced to below problem:

Given 16 bytes, replace E0 with 2, ED with 3, F0 with 3, F4 with 4, others with 0.

A naive SIMD approach:

  1. Compare 16 bytes with E0, get the mask for eacy byte (FF if equal, 00 otherwise)
  2. And the mask with 2 to get adjustment for E0
  3. Repeat step 1,2 for ED,F0,F4

At least eight operations are required for naive approach.

Observing special bytes(E0,ED,F0,F4) are close to each other, we can do much better using lookup table.

NEON

NEON tbl instruction is very convenient for table lookup:

Leverage these features, we can solve the problem with as few as two operations:

SSE

SSE pshufb instruction is not as friendly as NEON tbl in this case:

We can still leverage these features to solve the problem in five operations:

Error handling

Handling remaining bytes

For remaining input less than 16 bytes, we will fallback to naive byte by byte approach to validate them, which is actually faster than SIMD processing.

Tests

It's necessary to design test cases to cover corner cases as more as possible.

Positive cases

  1. Prepare correct characters
  2. Validate correct characters
  3. Validate long strings
    • Round concatenate characters starting from first character to 1024 bytes
    • Validate 1024 bytes string
    • Shift 1 byte, validate 1025 bytes string
    • Shift 2 bytes, Validate 1026 bytes string
    • ...
    • Shift 16 bytes, validate 1040 bytes string
  4. Repeat step3, test buffer starting from second character
  5. Repeat step3, test buffer starting from third character
  6. ...

Negative cases

  1. Prepare bad characters and bad strings
    • Bad character
    • Bad character cross 16 bytes boundary
    • Bad character cross last 16 bytes and remaining bytes boundary
  2. Test long strings
    • Prepare correct long strings same as positive cases
    • Append bad characters
    • Shift one byte for each iteration
    • Validate each shift

Code breakdown

Below table shows how 16 bytes input are processed step by step. See range-neon.c for according code.

Range based UTF-8 validation algorithm