Home

Awesome

This repository:

Running make will generate and compile and benchmark a handful of different implementations. Alternatively, run make generate and then use ./generate to generate a particular implementation.

Terse syntax for describing CRC32 implementations

Step 1: Instruction set (-i)

Terse syntaxgcc equivalentExample CPUs
-i neonaarch64 -march=armv8-a+crypto+crcApple M1, GCP Tau T2A (Ampere Altra Arm)
-i neon_eor3aarch64 -march=armv8.2-a+crypto+sha3Apple M1
-i ssex86_64 -msse4.2 -mpclmulAny Intel or AMD CPU from the last 10 years
-i avx512x86_64 -mavx512f -mavx512vlIntel Skylake X and newer, AMD Zen 4
-i avx512_vpclmulqdqx86_64 -mavx512f -mavx512vl -mvpclmulqdqIntel Ice Lake and newer, AMD Zen 4

For CRC32, the key feature required of any instruction set is a vector carryless multiplication instruction, e.g. pclmulqdq on x86_64 or pmull on aarch64. Three-way exclusive-or is also useful, e.g. vpternlogq on x86_64 or eor3 on aarch64.

Step 2: Choice of polynomial (-p)

Terse syntaxPolynomialHardware accelerationExample applications
-p crc320x04C11DB7aarch64Ethernet, SATA, zlib, png
-p crc32c0x1EDC6F41aarch64, x86_64iSCSI, Btrfs, ext4
-p crc32k0x741B8CD7no
-p crc32q0x814141ABnoAIXM
-p 0x876543210x87654321nonone

Most new applications should probably use -p crc32c. A large number of existing applications use -p crc32. If you really care about this, then you can also specify an arbitrary polynomial.

Some instruction sets contain scalar instructions for accelerating particular CRC32 polynomials. Notably, aarch64 has scalar acceleration for -p crc32 and -p crc32c, whereas x86_64 only has scalar acceleration for -p crc32c.

Step 3: Algorithm string (-a)

Six different parameters control the generated algorithm:

  1. Number of vector accumulators to use (16 or 64 bytes each).
  2. Number of vector loads to perform per iteration (defaults to number of vector accumulators, and must be a multiple of the number of vector accumulators).
  3. Number of scalar accumulators to use (4 bytes each).
  4. Number of scalar 8-byte loads to perform per iteration (defaults to number of scalar accumulators, and must be a multiple of the number of scalar accumulators).
  5. The outer loop size, if any.
  6. Whether inner loop termination is length-based or pointer-based.

The number of vector accumulators is specified by the letter v, followed by a decimal number. If the number of vector loads per iteration is non-default, this is followed by the letter x, followed by the ratio of loads to accumulators as a decimal number. For example, v4 denotes four vector accumulators and four vector loads per iteration, and v3x2 denotes three vector accumulators and six vector loads per iteration.

The number of scalar accumulators is specified in a similar way, but with s instead of v. For example, s3 denotes three scalar accumulators and three scalar loads per iteration, and s5x3 denotes five scalar accumulators and 15 scalar loads per iteration.

An outer loop size (in bytes) can be specified by the letter k, followed by a decimal number. Using an outer loop slightly decreases the complexity of the inner loop (because the inner loop trip count becomes constant), but makes the implementation less well suited to input sizes that aren't a multiple of the outer loop size.

The inner loop condition can be based on the remaining length (the default), or based on pointer comparisons (specify the letter e). Length-based is easier for humans to comprehend. Pointer-based can make the inner loop slightly simpler, at the expense of slightly more logic outside the loop.

The various parameters are concatenated to give the algorithm string, for example:

Multiple sets of parameters can be separated by an _ character, for example -a v4_v1 uses a v4 algorithm for as long as possible, then switches to v1 for any remaining bytes. If there are still any remaining bytes, then an implicit _s1 is appended to deal with them.

Benchmark results

Apple M1 performance (single core)

ImplementationSpeedOur equivalentSpeed
Chromium scalar25.03 GB/s-i neon -p crc32 -a s3k95760_s325.50 GB/s
Chromium vector26.57 GB/s-i neon -p crc32 -a v4_v134.03 GB/s
dougallj Faster CRC3277.55 GB/s-i neon -p crc32 -a v12e_v177.69 GB/s
fusion-inspiredN/A-i neon_eor3 -p crc32 -a v9s3x2e_s385.05 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_neon_crc32_s3k95760_s3.dylib chromium_neon_crc32_v4_v1.dylib dougallj_neon_crc32_v12e_v1.dylib
$ ./bench -r 40 third_party/chromium_neon_crc32_s3k95760_s3.dylib third_party/chromium_neon_crc32_v4_v1.dylib third_party/dougallj_neon_crc32_v12e_v1.dylib
$ ./autobench -r 40 -i neon -p crc32 -a s3k95760_s3,v4_v1,v12e_v1 -i neon_eor3 -a v9s3x2e_s3

Rich microarchitecture details are available, though the key details for CRC32 distill down to the following table:

PrimitiveUopsLoadScalarVectorBytesScoreLatency
ldr, pmull+eor, pmull2+eor5/81/3 or 0.5/32/41625.66
ldr, pmull, pmull2, eor34/81/3 or 0.5/33/41621.35
ldr, pmull, pmull2, eor, eor5/81/3 or 0.5/34/41616.07
ldr, crc32x2/81/3 or 0.5/31/188.03

Score is computed as Bytes / max(Uops, Load, Scalar, Vector), and gives the optimal bytes per cycle that we can hope to achieve with the corresponding primitive. The compiler might fuse two consecutive ldr instructions into a single ldp instruction, which reduces Load from 1/3 to 0.5/3, but does not decrease Uops. The CPU can fuse eor into an immediately preceding pmull or pmull2, which decreases the demand on the vector execution units, but such fused instructions still count as two for Uops. Chromium's scalar implementation is a decent application of the ldr, crc32x primitive, with three scalar accumulators sufficient to match the latency of 3. Chromium's vector implementation is a poor application of the ldr, pmull, pmull2, eor, eor primitive, as the four vector accumulators are insufficient to match the latency of 7. Our v4_v1 does better by using the ldr, pmull+eor, pmull2+eor primitive, but the four vector accumulators are still insufficient to match the latency of 6. Dougallj does a good application of the ldr, pmull+eor, pmull2+eor primitive, using 12 vector accumulators to match the latency. The fun twist is the ldr, pmull, pmull2, eor3 primitive; it scores worse than ldr, pmull+eor, pmull2+eor, so if just one primitive is used, ldr, pmull, pmull2, eor3 will lose out to ldr, pmull+eor, pmull2+eor, but ldr, pmull, pmull2, eor3 is less demanding on Uops, so a blend of ldr, pmull, pmull2, eor3 with ldr, crc32x can come out ahead. The v9s3x2e_s3 implementation blends nine copies of ldr, pmull, pmull2, eor3 with six copies of ldr, crc32x, which gives Uops = 48/8, Scalar = 6/1, Vector = 27/4, Bytes = 192, and thus a score of 28.4, which puts it just ahead of ldr, pmull+eor, pmull2+eor.

GCP Tau T2A (Ampere Altra Arm) performance (single core)

ImplementationSpeedOur equivalentSpeed
Chromium scalar22.68 GB/s-i neon -p crc32 -a s3k95760_s323.57 GB/s
Chromium vector16.91 GB/s-i neon -p crc32 -a v4_v120.94 GB/s
dougallj Faster CRC3217.41 GB/s-i neon -p crc32 -a v12e_v121.81 GB/s
fusion-inspiredN/A-i neon -p crc32 -a v3s4x2e_v235.87 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_neon_crc32_s3k95760_s3.so chromium_neon_crc32_v4_v1.so dougallj_neon_crc32_v12e_v1.so
$ ./bench -r 40 third_party/chromium_neon_crc32_s3k95760_s3.so third_party/chromium_neon_crc32_v4_v1.so third_party/dougallj_neon_crc32_v12e_v1.so
$ ./autobench -r 40 -i neon -p crc32 -a s3k95760_s3,v4_v1,v12e_v1,v3s4x2e_v2

The CPU datasheet says:

Four-wide superscalar aggressive out-of-order execution CPU

Dual full-width (128-bit wide) SIMD execution pipes

Meanwhile, microarchitecture measurements suggest that:

These latency and throughput numbers are consistent with Chromium scalar being faster than Chromium vector. The optimal fusion also involves more bytes being processed by the scalar pipes (64 bytes per iteration) than the vector pipes (48 bytes per iteration).

x86_64 Cascade Lake performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
stepping        : 7
microcode       : 0x5003006
cpu MHz         : 2199.998
cache size      : 16384 KB
ImplementationSpeedOur equivalentSpeed
Chromium vector15.87 GB/s-i sse -p crc32 -a v4_v116.48 GB/s
crc32_4k_three_way15.04 GB/s-i sse -p crc32c -a s3k4096e15.11 GB/s
crc32_4k_pclmulqdq15.78 GB/s-i sse -p crc32c -a v4k4096e15.63 GB/s
retuned crc32_4k_pclmulqdqN/A-i sse -p crc32c -a v4e16.03 GB/s
AVX512 crc32_4k_pclmulqdqN/A-i avx512 -p crc32c -a v4e17.36 GB/s
crc32_4k_fusion25.32 GB/s-i sse -p crc32c -a v4s3x3k4096e25.42 GB/s
retuned crc32_4k_fusionN/A-i sse -p crc32c -a v8s3x328.55 GB/s
AVX512 crc32_4k_fusionN/A-i avx512 -p crc32c -a v9s3x4e31.55 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,v4k4096e,v4e -i avx512 -a v4e -i sse -a v4s3x3k4096e,v8s3x3 -i avx512 -a v9s3x4e

Note that Cascade Lake lacks VPCLMULQDQ, so the only useful part of AVX512 is VPTERNLOGQ, hence the only very limited performance increase from AVX512 here.

x86_64 Ice Lake performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 106
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
stepping        : 6
microcode       : 0xffffffff
cpu MHz         : 2600.028
cache size      : 55296 KB
ImplementationSpeedOur equivalentSpeed
Chromium vector24.08 GB/s-i sse -p crc32 -a v4_v124.02 GB/s
Chromium vector AVX51241.00 GB/s-i avx512_vpclmulqdq -p crc32 -a v4_v147.31 GB/s
crc32_4k_three_way19.68 GB/s-i sse -p crc32c -a s3k4096e19.80 GB/s
retuned crc32_4k_three_wayN/A-i sse -p crc32c -a s322.50 GB/s
crc32_4k_pclmulqdq22.80 GB/s-i sse -p crc32c -a v4k4096e22.80 GB/s
retuned crc32_4k_pclmulqdqN/A-i sse -p crc32c -a v4e24.05 GB/s
AVX512 crc32_4k_pclmulqdqN/A-i avx512_vpclmulqdq -p crc32c -a v4x2e47.50 GB/s
crc32_4k_fusion35.12 GB/s-i sse -p crc32c -a v4s3x3k4096e34.72 GB/s
retuned crc32_4k_fusionN/A-i sse -p crc32c -a v7s3x342.58 GB/s
AVX512 crc32_4k_fusionN/A-i avx512_vpclmulqdq -p crc32c -a v4s5x363.98 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e,v4e -i avx512_vpclmulqdq -a v4e -i sse -a v4s3x3k4096e,v7s3x3 -i avx512_vpclmulqdq -a v4s5x3

Ice Lake introduces VPCLMULQDQ, but VPCLMULQDQ on 64-byte registers is 1*p05+2*p5 (versus regular PCLMULQDQ on 16-byte registers being 1*p5), so going from 16-byte to 64-byte registers only gives a ~2x speedup. Chromium vector AVX512 leaves some performance on the table due to not using VPTERNLOGQ.

x86_64 Sapphire Rapids performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 143
model name      : Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
stepping        : 8
microcode       : 0xffffffff
cpu MHz         : 2699.998
cache size      : 107520 KB
ImplementationSpeedOur equivalentSpeed
Chromium vector23.42 GB/s-i sse -p crc32 -a v4_v123.31 GB/s
Chromium vector AVX51265.20 GB/s-i avx512_vpclmulqdq -p crc32 -a v4_v194.00 GB/s
(with aligned inputs)81.68 GB/s-i avx512_vpclmulqdq -p crc32 -a v4_v194.00 GB/s
crc32_4k_three_way17.45 GB/s-i sse -p crc32c -a s3k4096e17.59 GB/s
retuned crc32_4k_three_wayN/A-i sse -p crc32c -a s323.84 GB/s
crc32_4k_pclmulqdq22.79 GB/s-i sse -p crc32c -a v4k4096e22.79 GB/s
retuned crc32_4k_pclmulqdqN/A-i sse -p crc32c -a v4e23.29 GB/s
AVX512 crc32_4k_pclmulqdqN/A-i avx512_vpclmulqdq -p crc32c -a v4e94.84 GB/s
crc32_4k_fusion33.87 GB/s-i sse -p crc32c -a v4s3x3k4096e34.68 GB/s
retuned crc32_4k_fusionN/A-i sse -p crc32c -a v8s3x336.98 GB/s
AVX512 crc32_4k_fusionN/A-i avx512_vpclmulqdq -p crc32c -a v3s1_s397.30 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e,v4e -i avx512_vpclmulqdq -a v4e -i sse -a v4s3x3k4096e,v8s3x3 -i avx512_vpclmulqdq -a v3s1_s3

Sapphire Rapids improves VPCLMULQDQ on 64-byte registers to 1*p5, meaning that AVX512 implementations can approach 4x the performance of implementations using 16-byte vectors (and slightly exceed 4x when also gaining VPTERNLOGQ). One catch is that we now observe a 20% performance penalty for unaligned 64-byte loads, which the Chromium code doesn't take care to handle.

x86_64 AMD Rome performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7B12
stepping        : 0
microcode       : 0xffffffff
cpu MHz         : 2249.998
cache size      : 512 KB
ImplementationSpeedOur equivalentSpeed
Chromium vector12.89 GB/s-i sse -p crc32 -a v4_v112.84 GB/s
crc32_4k_three_way23.52 GB/s-i sse -p crc32c -a s3k4096e23.08 GB/s
crc32_4k_pclmulqdq12.86 GB/s-i sse -p crc32c -a v4k4096e12.89 GB/s
crc32_4k_fusion23.84 GB/s-i sse -p crc32c -a v4s3x3k4096e22.62 GB/s
retuned crc32_4k_fusionN/A-i sse -p crc32c -a v1s3x331.16 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -p crc32c -a s3k4096e,v4k4096e,v4s3x3k4096e,v1s3x3

x86_64 AMD Milan performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : AuthenticAMD
cpu family      : 25
model           : 1
model name      : AMD EPYC 7B13
stepping        : 0
microcode       : 0xffffffff
cpu MHz         : 2450.000
cache size      : 512 KB
ImplementationSpeedOur equivalentSpeed
Chromium vector12.82 GB/s-i sse -p crc32 -a v4_v112.85 GB/s
crc32_4k_three_way22.75 GB/s-i sse -p crc32c -a s3k4096e23.55 GB/s
crc32_4k_pclmulqdq12.76 GB/s-i sse -p crc32c -a v4k4096e12.86 GB/s
crc32_4k_fusion23.41 GB/s-i sse -p crc32c -a v4s3x3k4096e22.64 GB/s
retuned crc32_4k_fusionN/A-i sse -p crc32c -a v1s4x231.76 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -p crc32c -a s3k4096e,v4k4096e,v4s3x3k4096e,v1s4x2

x86_64 AMD Genoa performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : AuthenticAMD
cpu family      : 25
model           : 17
model name      : AMD EPYC 9B14
stepping        : 1
microcode       : 0xffffffff
cpu MHz         : 2599.998
cache size      : 1024 KB
ImplementationSpeedOur equivalentSpeed
Chromium vector13.73 GB/s-i sse -p crc32 -a v4_v113.72 GB/s
Chromium vector AVX51251.70 GB/s-i avx512_vpclmulqdq -p crc32 -a v4_v154.00 GB/s
crc32_4k_three_way26.27 GB/s-i sse -p crc32c -a s3k4096e26.56 GB/s
retuned crc32_4k_three_wayN/A-i sse -p crc32c -a s327.32 GB/s
crc32_4k_pclmulqdq13.73 GB/s-i sse -p crc32c -a v4k4096e13.74 GB/s
AVX512 crc32_4k_pclmulqdqN/A-i avx512_vpclmulqdq -p crc32c -a v3x254.77 GB/s
crc32_4k_fusion28.47 GB/s-i sse -p crc32c -a v4s3x3k4096e28.56 GB/s
retuned crc32_4k_fusionN/A-i sse -p crc32c -a v1s3x236.20 GB/s
AVX512 crc32_4k_fusionN/A-i avx512_vpclmulqdq -p crc32c -a v3s2x471.95 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e -i avx512_vpclmulqdq -a v3x2 -i sse -a v4s3x3k4096e,v1s3x2 -i avx512_vpclmulqdq -a v3s2x4

Related reading

Background mathematics:

CRC implementations: