Awesome

About

In theory, Single instruction, multiple data (SIMD) vectorization methods can dramatically accelerate data processing. In particular, in brain imaging we often want to analyze the data from millions of voxels. This project explores how processing of 32-bit floats is influenced by 128-bit SSE (4 voxels per instruction) and 256-bit AVX (8 voxels per instruction) benefit analysis.

There are three ways that one could use SIMD methods. First, one could hand code assembly, which is very tedious. Second, one could leverage higher-level intrinsics. Finally, one could hope a modern compiler would be smart enough to use clever instructions.

The primary goal of this project was to teach myself about intrinsics and see if they provide any benefit. The brief take-away is that modern compiler optimazation (in particular, -O3 with its ability to loop vectorize negates any benefit of explicitly coded SIMD. It may be that these very simple operations are simply constrained by memory latency and bandwidth. Indeed, modern computers face a memory wall where CPUs spend most of their time idle while waiting for data.

This project examines a typical sized dataset (173 million voxels) using two instructions. AVX introduced the FMA for fused_multiply–add. This computes a multiplication and addition in a single instruction, whereas this is traditionally two separate instructions with early Intel interfaces. Therefore, one might hope that the ability of AVX2 to compute 8 items in parallel could provide 16 times the performance of traditional methods (and four times the performance of 128-bit SSE). The second method explored is the loading and scaling of 16-bit integer data to 32-bit float data. MRI scanners tend to store data with 16-bits of precision, providing a scaling and intercept factor to convert these integers into real numbers. The SSE _mm_cvtepi32_ps instruction can convert four of these at once, again offering the promise of higher performance.

Here the program is compiled without optimization. Note that SSE and AVX dramatically speed up the square-root (sqrt) and fused multiply-add (fma) operations.

>g++ -o tst main.cpp -march=native; ./tst 10 4
Only using 1 thread (not compiled with OpenMP support)
Reporting minimum time for 10 tests
i16_f32: min/mean	561	563	ms
i16_f32sse: min/mean	494	499	ms
sqrt: min/mean	420	424	ms
sqrtSSE: min/mean	172	175	ms
sqrtAVX: min/mean	97	98	ms
fma: min/mean	341	343	ms
fmaSSE: min/mean	253	257	ms
fmaAVX: min/mean	107	111	ms
fma (memory alignment not forced): min/mean	342	346	ms

Next, we optimize the compiler with -O3, and illustrate that the benefits for hand-coded SIMD disappear. Here we also compile with Clang (which does not support OpenMP on this MacOS computer) and gcc.

>g++ -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.0 (clang-1100.0.33.8)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
>g++ -O3 -o tst main.cpp -march=native; ./tst 10 4
Only using 1 thread (not compiled with OpenMP support)
Reporting minimum time for 10 tests
i16_f32: min/mean	70	71	ms
i16_f32sse: min/mean	70	76	ms
sqrt: min/mean	71	72	ms
sqrtSSE: min/mean	73	74	ms
sqrtAVX: min/mean	71	71	ms
fma: min/mean	72	73	ms
fmaSSE: min/mean	72	73	ms
fmaAVX: min/mean	72	72	ms
fma (memory alignment not forced): min/mean	82	87	ms
>g++-9 -v
Using built-in specs.
COLLECT_GCC=g++-9
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/9.2.0_1/libexec/gcc/x86_64-apple-darwin18/9.2.0/lto-wrapper
Target: x86_64-apple-darwin18
Configured with: ../configure --build=x86_64-apple-darwin18 --prefix=/usr/local/Cellar/gcc/9.2.0_1 --libdir=/usr/local/Cellar/gcc/9.2.0_1/lib/gcc/9 --disable-nls --enable-checking=release --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-9 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --with-pkgversion='Homebrew GCC 9.2.0_1' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-multilib --with-native-system-header-dir=/usr/include --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
Thread model: posix
gcc version 9.2.0 (Homebrew GCC 9.2.0_1) 
>g++-9 -O3 -fopenmp -o tst main.cpp -march=native; ./tst 10 4
Reporting minimum time for 10 tests
Using 4 threads...
i16_f32: min/mean	69	75	ms
i16_f32sse: min/mean	79	80	ms
sqrt: min/mean	234	238	ms
sqrtSSE: min/mean	72	73	ms
sqrtAVX: min/mean	67	70	ms
fma: min/mean	228	230	ms
fmaSSE: min/mean	72	74	ms
fmaAVX: min/mean	69	69	ms
fma (memory alignment not forced): min/mean	223	225	ms
Using 1 thread...
i16_f32: min/mean	78	78	ms
i16_f32sse: min/mean	79	80	ms
sqrt: min/mean	156	157	ms
sqrtSSE: min/mean	73	75	ms
sqrtAVX: min/mean	70	70	ms
fma: min/mean	72	73	ms
fmaSSE: min/mean	75	75	ms
fmaAVX: min/mean	71	72	ms
fma (memory alignment not forced): min/mean	88	88	ms

Therefore, the takeaway is that modern compilers (at least Clang) can allow us to write classic C that is easy to read, maintain and port to other systems while still offering excellent performance. While more complicated routines may benefit from SIMD, explicit coding is not required for simpler operations.

Likewise, OpenMP is an easy way to set up parallel threading. Parallel threads can have dramatic benefits for situations that are not constrained by memory bandwidth. However, most of the code the results above are memory constrained.

Historically, gcc generated faster code than clang. However, this is no longer the case. Here we see that Clang does a better job optimizing the sqrt and fma functions. Hopefully, future releases of Clang for MacOS will provide better support OpenMP. By default, Clang on MacOS does not support OpenMP. While it can be https://iscinumpy.gitlab.io/post/omp-on-high-sierra/, the example below shows it the current implementation can be deleterious for these memory-constrained tasks:

>g++ -Xpreprocessor -fopenmp -lomp  -O3 -o tst main.cpp -march=native; ./tst 10 4
Reporting minimum time for 10 tests
Using 4 threads...
i16_f32: min/mean	76	78	ms
i16_f32sse: min/mean	76	78	ms
sqrt: min/mean	224	228	ms
sqrtSSE: min/mean	1265	1421	ms
sqrtAVX: min/mean	733	792	ms
fma: min/mean	215	225	ms
fmaSSE: min/mean	1461	1551	ms
fmaAVX: min/mean	1064	1084	ms
fma (memory alignment not forced): min/mean	221	300	ms
Using 1 thread...
i16_f32: min/mean	79	80	ms
i16_f32sse: min/mean	78	80	ms
sqrt: min/mean	70	71	ms
sqrtSSE: min/mean	91	92	ms
sqrtAVX: min/mean	76	78	ms
fma: min/mean	71	71	ms
fmaSSE: min/mean	90	93	ms
fmaAVX: min/mean	74	75	ms
fma (memory alignment not forced): min/mean	85	93	ms

Therefore, for these tests (and my experience) and for the current generation of compilers (early 2020), Clang does a better job of optimizing single-threaded code, but gcc handles multi-threaded code better.

This program can also be compiled for ARM-based CPUs like the Apple M1 by targeting the 128-bit Neon instructions. The sse2neon translations are used for intrinsic functions. To compile for ARM, run make ARCH=arm. Future ARM CPUs are expected to support Scalable Vector Extension (SVE) instructions, providing a comparison with SSE.

Awesome

About

Links