Awesome
Integer + Floating Point Compression Filter
- Fastest transpose/shuffle
- :new: (2019.11) ALL TurboTranspose functions now available under 64 bits ARMv8 including NEON SIMD.
- Byte/Nibble transpose/shuffle for improving compression of binary data (ex. floating point data)
- :sparkles: Scalar/SIMD Transpose/Shuffle 8,16,32,64,... bits
- :+1: Dynamic CPU detection and JIT scalar/sse/avx2 switching
- 100% C (C++ headers), usage as simple as memcpy
- Byte Transpose
- Fastest byte transpose
- :new: (2019.11) 2D,3D,4D transpose
- Nibble Transpose
- nearly as fast as byte transpose
- more efficient, up to 10 times! faster than Bitshuffle
- :new: better compression (w/ lz77) and<br> 10 times! faster than one of the best floating-point compressors SPDP
- can compress/decompress (w/ lz77) better and faster than other domain specific floating point compressors
- Scalar and SIMD Transform
- Delta encoding for sorted lists
- Zigzag encoding for unsorted lists
- Xor encoding
- :new: lossy floating point compression with user-defined error
Transpose Benchmark:
- Benchmark Intel CPU: Skylake i7-6700 3.4GHz gcc 9.2 single thread
- Benchmark ARM: ARMv8 A73-ODROID-N2 1.8GHz
- Speed test
Benchmark w/ 16k buffer
BOLD = pareto frontier.<br>
E:Encode, D:Decode<br>
./tpbench -s# file -B16K (# = 8,4,2)
E cycles/byte | D cycles/byte | Transpose 64 bits AVX2 |
---|
.199 | .134 | TurboTranspose Byte |
.326 | .201 | Blosc byteshuffle |
.394 | .260 | TurboTranspose Nibble |
.848 | .478 | Bitshuffle 8 |
E cycles/byte | D cycles/byte | Transpose 32 bits AVX2 |
---|
.121 | .102 | TurboTranspose Byte |
.451 | .139 | Blosc byteshuffle |
.345 | .229 | TurboTranspose Nibble |
.773 | .476 | Bitshuffle |
E cycles/byte | D cycles/byte | Transpose 16 bits AVX2 |
---|
.095 | .071 | TurboTranspose Byte |
.640 | .108 | Blosc byteshuffle |
.329 | .198 | TurboTranspose Nibble |
.758 | 1.177 | Bitshuffle 2 |
.067 | .067 | memcpy |
E MB/s | D MB/s | 16 bits ARM 2019.11 |
---|
8192 | 16384 | TurboTranspose Byte |
8192 | 8192 | blosc byteshuffle |
1638 | 2341 | TurboTranspose Nibble |
356 | 287 | blosc bitshuffle |
16384 | 16384 | memcpy |
E MB/s | D MB/s | 32 bits ARM 2019.11 |
---|
8192 | 8192 | TurboTranspose Byte |
8192 | 8192 | blosc byteshuffle |
1820 | 2341 | TurboTranspose Nibble |
372 | 252 | blosc bitshuffle |
E MB/s | D MB/s | 64 bits ARM 2019.11 |
---|
4096 | 8192 | TurboTranspose Byte |
5461 | 5461 | blosc byteshuffle |
1490 | 1490 | TurboTranspose Nibble |
372 | 260 | blosc bitshuffle |
Transpose/Shuffle benchmark w/ large files (100MB).
MB/s: 1,000,000 bytes/second<br>
./tpbench -s# file (# = 8,4,2)
E MB/s | D MB/s | Transpose 16 bits AVX2 2019.11 |
---|
9208 | 9795 | TurboTranspose Byte |
8382 | 7689 | Blosc byteshuffle |
9377 | 9584 | TurboTranspose Nibble |
2750 | 2530 | Blosc bitshuffle |
13725 | 13900 | memcpy |
E MB/s | D MB/s | Transpose 32 bits AVX2 2019.11 |
---|
9718 | 9713 | TurboTranspose Byte |
9181 | 9030 | Blosc byteshuffle |
8750 | 9472 | TurboTranspose Nibble |
2767 | 2942 | Blosc bitshuffle 4 |
E MB/s | D MB/s | Transpose 64 bits AVX2 2019.11 |
---|
8998 | 9573 | TurboTranspose Byte |
8721 | 8586 | Blosc byteshuffle 2 |
8252 | 9222 | TurboTranspose Nibble |
2711 | 2053 | Blosc bitshuffle 2 |
E MB/s | D MB/s | 16 bits ARM 2019.11 |
---|
872 | 3998 | TurboTranspose Byte |
678 | 3852 | blosc byteshuffle |
1365 | 2195 | TurboTranspose Nibble |
357 | 280 | blosc bitshuffle |
3921 | 3913 | memcpy |
E MB/s | D MB/s | 32 bits ARM 2019.11 |
---|
1828 | 3768 | TurboTranspose Byte |
1769 | 3713 | blosc byteshuffle |
1456 | 2299 | TurboTranspose Nibble |
374 | 243 | blosc bitshuffle |
E MB/s | D MB/s | 64 bits ARM 2019.11 |
---|
1793 | 3572 | TurboTranspose Byte |
1784 | 3544 | blosc byteshuffle |
1176 | 1267 | TurboTranspose Nibble |
331 | 203 | blosc bitshuffle |
- Compression test (transpose/shuffle+lz4)
:new: Download IcApp a new benchmark for TurboPFor+TurboTranspose<br>
for testing allmost all integer and floating point file types.<br>
Note: Lossy compression benchmark with icapp only.
- Speed test (file msg_sweep3d)
C size | ratio % | C MB/s | D MB/s | Name AVX2 |
---|
11,348,554 | 18.1 | 2276 | 4425 | TurboTranspose Nibble+lz |
22,489,691 | 35.8 | 1670 | 3881 | TurboTranspose Byte+lz |
43,471,376 | 69.2 | 348 | 402 | SPDP |
44,626,407 | 71.0 | 1065 | 2101 | bitshuffle+lz |
62,865,612 | 100.0 | 13300 | 13300 | memcpy |
./tpbench -s4 -z *.sp
File | File size | lz % | Tp8lz | Tp4lz | BSlz | spdp1 | | spdp9 | Tp4lzt | eTp4lzt |
---|
msg_bt | 133194716 | 94.3 | 70.4 | 66.4 | 73.9 | 70.0 | | 67.4 | 54.7 | 32.4 |
msg_lu | 97059484 | 100.4 | 77.1 | 70.4 | 75.4 | 76.8 | | 74.0 | 61.0 | 42.2 |
msg_sppm | 139497932 | 11.7 | 11.6 | 12.6 | 15.4 | 14.4 | | 13.7 | 9.0 | 5.6 |
msg_sp | 145052928 | 100.3 | 68.8 | 63.7 | 68.1 | 67.9 | | 65.3 | 52.6 | 24.9 |
msg_sweep3d | 62865612 | 98.7 | 35.8 | 18.1 | 71.0 | 69.6 | | 13.7 | 9.8 | 3.8 |
num_brain | 70920000 | 100.4 | 76.5 | 71.1 | 77.4 | 79.1 | | 73.9 | 63.4 | 32.6 |
num_comet | 53673984 | 92.4 | 79.0 | 77.6 | 82.1 | 84.5 | | 84.6 | 70.1 | 41.7 |
num_control | 79752372 | 99.4 | 89.5 | 90.7 | 88.1 | 98.3 | | 98.5 | 81.4 | 51.2 |
num_plasma | 17544800 | 100.4 | 0.7 | 0.7 | 75.5 | 30.7 | | 2.9 | 0.3 | 0.2 |
obs_error | 31080408 | 89.2 | 73.1 | 70.0 | 76.9 | 78.3 | | 49.4 | 20.5 | 12.2 |
obs_info | 9465264 | 93.6 | 70.2 | 61.9 | 72.9 | 62.4 | | 43.8 | 27.3 | 15.1 |
obs_spitzer | 99090432 | 98.3 | 90.4 | 95.6 | 93.6 | 100.1 | | 100.7 | 80.2 | 52.3 |
obs_temp | 19967136 | 100.4 | 89.5 | 92.4 | 91.0 | 99.4 | | 100.1 | 84.0 | 55.8 |
Tp8=Byte transpose, Tp4=Nibble transpose, lz = lz4<br />
eTp4Lzt = lossy compression with lzturbo and allowed error = 0.0001 (1e-4)<br />
Slow but best compression: SPDP9 and lzt = lzturbo,39
File | File size | lz % | Tp8lz | Tp4lz | BSlz | spdp1 | | spdp9 | Tp4lzt | eTp4lzt |
---|
msg_bt | 266389432 | 94.5 | 77.2 | 76.5 | 81.6 | 77.9 | | 75.4 | 69.9 | 16.0 |
msg_lu | 194118968 | 100.4 | 82.7 | 81.0 | 83.7 | 83.3 | | 79.6 | 75.5 | 21.0 |
msg_sppm | 278995864 | 18.9 | 14.5 | 14.9 | 19.5 | 21.5 | | 19.8 | 11.2 | 2.8 |
msg_sp | 290105856 | 100.4 | 79.2 | 77.5 | 80.2 | 78.8 | | 77.1 | 71.3 | 12.4 |
msg_sweep3d | 125731224 | 98.7 | 50.7 | 36.7 | 80.4 | 76.2 | | 33.2 | 27.3 | 1.9 |
num_brain | 141840000 | 100.4 | 82.6 | 81.1 | 84.5 | 87.8 | | 83.3 | 77.0 | 16.3 |
num_comet | 107347968 | 92.8 | 83.3 | 78.8 | 76.3 | 86.5 | | 86.0 | 69.8 | 21.2 |
num_control | 159504744 | 99.6 | 92.2 | 90.9 | 89.4 | 97.6 | | 98.9 | 85.5 | 25.8 |
num_plasma | 35089600 | 75.2 | 0.7 | 0.7 | 84.5 | 77.3 | | 3.0 | 0.3 | 0.1 |
obs_error | 62160816 | 78.7 | 81.0 | 77.5 | 84.4 | 87.9 | | 62.3 | 23.4 | 6.3 |
obs_info | 18930528 | 92.3 | 75.4 | 70.6 | 82.4 | 81.7 | | 51.2 | 33.1 | 7.7 |
obs_spitzer | 198180864 | 95.4 | 93.2 | 93.7 | 86.4 | 100.1 | | 102.4 | 78.0 | 26.9 |
obs_temp | 39934272 | 100.4 | 93.1 | 93.8 | 91.7 | 98.0 | | 97.4 | 88.2 | 28.8 |
eTp4Lzt = lossy compression with allowed error = 0.0001<br />
Compile:
git clone git://github.com/powturbo/TurboTranspose.git
cd TurboTranspose
Linux + Windows MingW
make
or
make AVX2=1
Windows Visual C++
nmake /f makefile.vs
or
nmake AVX2=1 /f makefile.vs
Testing:
-
benchmark "transpose" functions <br />
./tpbench [-s#] [-z] file
s# = element size #=2,4,8,16,... (default 4)
-z = only lz77 compression benchmark (bitshuffle package mandatory)
Function usage:
Byte transpose:
void tpenc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);<br>
void tpdec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)<br />
in : input buffer<br />
n : number of bytes<br />
out : output buffer<br />
esize : element size in bytes (2,4,8,...)<br />
Nibble transpose:
void tp4enc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);<br>
void tp4dec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)<br />
in : input buffer<br />
n : number of bytes<br />
out : output buffer<br />
esize : element size in bytes (2,4,8,...)<br />
Environment:
OS/Compiler (64 bits):
- Linux: GNU GCC (>=4.6)
- Linux: Clang (>=3.2)
- Windows: MinGW-w64 makefile
- Windows: Visual c++ (>=VS2008) - makefile.vs (for nmake)
- Windows: Visual Studio project file - vs/vs2017 - Thanks to PavelP
- Linux ARM: 64 bits aarch64 ARMv8: gcc (>=6.3)
- Linux ARM: 64 bits aarch64 ARMv8: clang
Multithreading:
- All TurboTranspose functions are thread safe
References:
Last update: 25 Oct 2019