Awesome

simd_utils

A header only library implementing common mathematical functions using SIMD intrinsics. This library is C/C++ compatible (tested with GCC7.5/9.3/10.2/11.3/12.0, clang 9 and icc 2021).

Thanks to Julien Pommier and Giovanni Garberoglio for their work on sin,cos,log, and exp functions in SSE, AVX, and NEON intrinsics. Thanks to the DLTcollab team for their work on sse2neon.

What is SIMD Utils?

The purpose of this library is to give an open-source implementation of SIMD optimized commonly used algorithms, such as type conversion (float32, float64, uint16, ...), trigonometry (sin, cos, atan, ...), log/exp, min/max, and other functions. Its API was thought as a simple replacement for Intel IPP/MKL libraries. Some of the functions are vectorised version of the cephes maths library (https://www.netlib.org/cephes/)

Why use SIMD Utils?

It's free
It's open source
It works on a wide range of machines, including Arm 32bits (with NEON) and 64bits

Targets

Supported targets are :

SSE (SSE4.X mostly)
AVX (AVX and AVX2)
AVX512
ARM Neon (through sse2neon plus some optimized functions).
RISC-V Vector extension 1.0
PowerPC Alitivec (no double precision suppport)

128 bit functions (SSE, NEON, ALTIVEC) are name function128type, such as asin128f, which computes the arcsinus function on an float32 array. Float64 functions have the "d" suffix. 256 bit functions (AVX/AVX2) have 256 instead of 128 in their name, such as asin256f. 256 bit functions (AVX512) have 512 instead of 128 in their name, such as cos512f. Vector functions (RISCV) for which the SIMD length makes less sense, are name functionType_vec, such as subs_vec, which substract an int32 array from and other one.

The project has been tested on :

Intel Atom
Intel Ivy Bridge Core-i7
Intel Skylake Core-i7
Intel Cannonlake Core-i7
Intel SDE (emulator) for AVX-512
Qemu 5.X (emulator) )for arm/aarch64, ppc and riscv
Cortex-a53 (Raspberry Pi 3B)
Cortex-a9 (ZYBO)
PowerPC G5 (iMac G5)
RISCV Ox64 (C906 core)

Building

To build the project you will need the sse_mathfun.h, avx_mathfun.h and neon_mathun.h headers available here http://gruntthepeon.free.fr/ssemath/, and there http://software-lisc.fbk.eu/avx_mathfun/ This project also uses a forked version of sse2neon (https://github.com/DLTcollab/sse2neon) adding functions such as double precision and Fused Multiple Add.

Simply include simd_utils.h in your C/C++ file, and compile with :

SSE support : gcc -DSSE -msse4.2 -c file.c -I .
AVX support : gcc -DSSE -DAVX -mavx2 -c file.c -I .
AVX512 support : gcc -DSSE -DAVX -DAVX512 -march=skylake-avx512 -mprefer-vector-width=512 -c file.c -I .
ARM V7 NEON support : arm-none-linux-gnueabihf-gcc -march=armv7-a -mfpu=neon -DARM -DSSE -flax-vector-conversions -c file.c -I .
ARM V8 NEON support : aarch64-linux-gnu-gcc -DARM -DFMA -DSSE -flax-vector-conversions -c file.c -I .
RISCV support : riscv64-unknown-linux-gnu-gcc -DRISCV -march=rv64gcv -c file.c-I .
ALTIVEC support : powerpc64-linux-gnu-gcc -DALTIVEC -DFMA -maltivec -flax-vector-conversions -c file.c -I .

For FMA support you need to add -DFMA and -mfma to x86 targets, and -DFMA to Armv8 targets. For ARMV7 targets, you could also add -DSSE2NEON_PRECISE_SQRT for improved accuracy with sqrt and rsqrt For X86 targets with ICC compiler, simply add -DICC to activate Intel SVML intrinsics. Altivec support is intended mostly for older Big Endian PowerPC. Newer Little Endian might benefit from a direct conversion from SSE similar to sse2neon.

OpenCL (experimental)

The same approach is applied to OpenCL kernels as an experiment, focused on GPUs, but other OpenCL devices may work. At the moment only some functions are supported (log, exp, sincos, tan, atan, atan2, asin, sqrt), based on the cephes library, which seems to be faster that the OpenCL native functions (tested on Intel GPU with beignet 1.3) To try it out, simply use :

gcc -DSSE -msse4.2 -march=native simd_test_opencl.c -lOpenCL -lrt -lm (add -DSIMPLE_BUFFERS for CPU devices)

Supported Functions

SSE/NEON are 128bits wide. SSE functions use up to SSE4.2 features. Some functions are directly coded using NEON intrinsics (for performance reasons), but most functions translate SSE code to NEON using sse2neon header. Some AVX functions, such as integer ones, require AVX2. The 256 bit integer functions are emulated using SSE for some floating point functions if AVX2 is unavailable. Altivec implemented functions are indicated with "(a)".

The following table is a work in progress, "?" means there is not yet an implemented function (or a directly equivalent Intel IPP function) :

SSE/NEON/ALTIVEC (X=128), AVX (X=256), AVX512 (X=512)	C_REF	IPP_REF	RISCV

log10Xf/precise (a)	log10f_C	ippsLog10_32f_A24	log10f_vec
log2Xf/precise (a)	log2f_C		log2f_vec
lnXf (a)	lnf_C	ippsLn_32f_A24	lnf_vec
lnXd	ln_C	ippsLn_64f_A53	?
expXf (a)	expf_C	ippsExp_32f_A24	expf_vec
cbrtXf (a)	cbrtf_C	?	cbrtf_vec
fabsXf (a)	fabsf_C	ippsAbs_32f	fabsf_vec
setXf (a)	setf_C	ippsSet_32f	setf_vec
zeroXf (a)	zerof_C	ippsZero_32f	zerof_vec
copyXf (a)	copyf_C	ippsCopy_32f	copyf_vec
addXf (a)	addf_c	ippsAdd_32f	addf_vec
mulXf (a)	mulf_C	ippsMul_32f	mulf_vec
subXf (a)	subf_c	ippsSub_32f	subf_vec
addcXf (a)	addcf_C	ippsAddC_32f	addcf_vec
mulcXf (a)	mulcf_C	ippsMulC_32f	mulcf_vec
muladdXf	muladdf_C	?	muladdf_vec
mulcaddXf	mulcaddf_C	?	mulcaddf_vec
mulcaddcXf	mulcaddcf_C	?	mulcaddcf_vec
muladdcXf	muladdcf_C	?	muladdcf_vec
divXf (a)	divf_C	ippsDiv_32f_A24	divf_vec
dotXf (a)	dotf_C	ippsDotProd_32f	dotf_vec
dotcXf (a)	dotcf_C	ippsDotProd_32fc	dotcf_vec
vectorSlopeXf (a)	vectorSlopef_C	ippsVectorSlope_32f	vectorSlopef_vec
convertFloat32ToU8_X (a)	convertFloat32ToU8_C	ippsConvert_32f8u_Sfs	convertFloat32ToU8_vec
convertFloat32ToU16_X (a)	convertFloat32ToI16_C	ippsConvert_32f16u_Sfs	convertFloat32ToU16_vec
convertFloat32ToI16_X (a)	convertFloat32ToI16_C	ippsConvert_32f16s_Sfs	convertFloat32ToI16_vec
convertInt16ToFloat32_X (a)	convertInt16ToFloat32_C	ippsConvert_16s32f_Sfs	convertInt16ToFloat32_vec
cplxtorealXf (a)	cplxtorealf_C	ippsCplxToReal_32fc	cplxtorealf_vec
realtocplxXf (a)	realtocplx_C	ippsRealToCplx_32f	realtocplxf_vec
convertX_64f32f	convert_64f32f_C	ippsConvert_64f32f	convert_64f32f_vec
convertX_32f64f	convert_32f64f_C	ippsConvert_32f64f	convert_32f64f_vec
flipXf (a)	flipf_C	ippsFlip_32f	flipf_vec
maxeveryXf (a)	maxeveryf_c	ippsMaxEvery_32f	maxeveryf_vec
mineveryXf (a)	mineveryf_c	ippsMinEvery_32f	mineveryf_vec
minmaxXf (a)	minmaxf_c	ippsMinMax_32f	minmaxf_vec
thresholdX_gt_f (a)	threshold_gt_f_C	ippsThreshold_GT_32f	threshold_gt_f_vec
thresholdX_gtabs_f (a)	threshold_gtabs_f_C	ippsThreshold_GTAbs_32f	threshold_gtabs_f_vec
thresholdX_lt_f (a)	threshold_lt_f_C	ippsThreshold_LT_32f	threshold_lt_f_vec
thresholdX_ltabs_f (a)	threshold_ltabs_f_C	ippsThreshold_LTAbs_32f	threshold_ltabs_f_vec
thresholdX_ltval_gtval_f (a)	threshold_ltval_gtval_f_C	ippsThreshold_LTValGTVal_32f	threshold_ltval_gtval_f_vec
sinXf	sinf_C	ippsSin_32f_A24	sinf_vec
cosXf	cosf_C	ippsCos_32f_A24	cosf_vec
sincosXf (a)	sincosf_C	ippsSinCos_32f_A24	sincosf_vec
sincosXf_interleaved (a)	sincosf_C_interleaved	ippsCIS_32fc_A24	sincosf_interleaved_vec
coshXf (a)	coshf_C	ippsCosh_32f_A24	coshf_vec
sinhXf (a)	sinhf_C	ippsSinh_32f_A24	sinhf_vec
acoshXf (a)	acoshf_C	ippsAcosh_32f_A24	acoshf_vec
asinhXf (a)	asinhf_C	ippsAsinh_32f_A24	asinhf_vec
atanhXf (a)	atanhf_C	ippsAtanh_32f_A24	atanh_vec
atanXf (a)	atanf_C	ippsAtan_32f_A24	atanf_vec
atan2Xf (a)	atan2f_C	ippsAtan2_32f_A24	atan2f_vec
atan2Xf_interleaved (a)	atan2f_interleaved_C	?	atan2f_interleaved_vec
asinXf (a)	asinf_C	ippsAsin_32f_A24	asinf_vec
tanhXf (a)	tanhf_C	ippsTanh_32f_A24	tanhf_vec
tanXf (a)	tanf_C	ippsTan_32f_A24	tanf_vec
tanXd (a)	tan_C	ippsTan_64f_A53	?
magnitudeXf_split (a)	magnitudef_C_split	ippsMagnitude_32f	magnitudef_split_vec
powerspectXf_split (a)	powerspectf_C_split	ippsPowerSpectr_32f	powerspectf_split_vec
magnitudeXf_interleaved	magnitudef_C_interleaved	ippsMagnitude_32fc	magnitudef_interleaved_vec
powerspectXf_interleaved	powerspectf_C_interleaved	ippsPowerSpectr_32fc	powerspectf_interleaved_vec
subcrevXf (a)	subcrevf_C	ippsSubCRev_32f	subcrevf_vec
sumXf (a)	sumf_C	ippsSum_32f	sumf_vec
meanXf (a)	meanf_C	ippsMean_32f	meanf_vec
sqrtXf (a)	sqrtf_C	ippsSqrt_32f	sqrtf_vec
roundXf (a)	roundf_C	ippsRound_32f	roundf_vec
rintXf (a)	rintf_C	?	rintf_vec
ceilXf (a)	ceilf_C	ippsCeil_32f	ceilf_vec
floorXf (a)	floorf_C	ippsFloor_32f	floorf_vec
truncXf (a)	truncf_C	ippsTrunc_32f	truncf_vec
modfXf (a)	modff_C	ippsModf_32f	modf_vec
cplxvecmulXf (a)	cplxvecmul_C/precise	ippsMul_32fc_A11/24	cplxvecmulf_vec
cplxvecmulXf_split (a)	cplxvecmul_C_split/precise	?	cplxvecmulf_vec_split
cplxconjvecmulXf (a)	cplxconjvecmul_C	ippsMulByConj_32fc_A24	cplxconjvecmulf_vec
cplxconjvecmulXf_split	cplxconjvecmul_C_split	?	cplxconjvecmulf_vec_split
cplxconjXf (a)	cplxconj_C	ippsConj_32fc_A24	cplxconjf_vec
cplxvecdivXf (a)	cplxvecdiv_C	?	cplxvecdivf_vec
cplxvecdivXf_split (a)	cplxvecdiv_C_split	?	cplxvecdivf_vec_split
setXd	setd_C	ippsSet_64f	setd_vec
zeroXd	zerod_C	ippsZero_64f	zerod_vec
copyXd	copyd_C	ippsCopy_64f	copyd_vec
sqrtXd	sqrtd_C	ippsSqrt_64f	sqrtd_vec
addXd	addd_c	ippsAdd_64f	addd_vec
mulXd	muld_c	ippsMul_64f	muld_vec
subXd	subd_c	ippsSub_64f	subd_vec
divXd	divd_c	ippsDiv_64f	divd_vec
addcXd	addcd_C	ippsAddC_64f	addcd_vec
mulcXd	mulcd_C	ippsMulC_64f	mulcd_vec
muladdXd	muladdd_C	?	muladdd_vec
mulcaddXd	mulcaddd_C	?	muladdcd_vec
mulcaddcXd	mulcaddcd_C	?	mulcaddcd_vec
muladdcXd	muladdcd_C	?	muladdcd_vec
roundXd	roundd_C	ippsRound_64f	roundd_vec
rintXd	rintd_C	?	rintd_vec
ceilXd	ceild_C	ippsCeil_64f	ceild_vec
floorXd	floord_C	ippsFloor_64f	floord_vec
truncXd	truncd_C	ippsTrunc_64f	truncd_vec
vectorSlopeXd	vectorSloped_C	ippsVectorSlope_64f	vectorSloped_vec
sincosXd	sincosd_C	ippsSinCos_64f_A53	sincosd_vec
sincosXd_interleaved	sincosd_C_interleaved	ippsCIS_64fc_A53	sincosd_interleaved_vec
atanXd	atan_C	ippsAtan_64f_A53	atand_vec
atan2Xd	atan2d_C	ippsAtan2_64f_A53	atan2d_vec
atan2Xd_interleaved	atan2_interleaved_C	?	atan2d_interleaved_vec
asinXd	asin_C	ippsAsin_64f_A53	asind_vec
cplxtorealXd	cplxtoreald_C	ippsCplxToReal_64fc	cplxtoreald_vec
realtocplxXd	realtocplxd_C	ippsRealToCplx_64f	realtocplxd_vec
expXd	exp_C	ippsExp_64f_A53	expd_vec
addXs (a)	adds_c	?	adds_vec
mulXs	muls_c	?	muls_vec
subXs (a)	subs_c	?	subs_vec
addcXs (a)	addcs_C	?	addcs_vec
vectorSlopeXs (a)	vectorSlopes_C	ippsVectorSlope_32s	vectorSlopes_vec
flipXs (a)	flips_C	?	flips_vec
maxeveryXs (a)	maxeverys_c	?	maxeverys_vec
mineveryXs (a)	mineverys_c	?	mineverys_vec
minmaxXs (a)	minmaxs_c	ippsMinMax_32s	minmaxs_vec
thresholdX_gt_s (a)	threshold_gt_s_C	ippsThreshold_GT_32s	thresholdX_gt_s_vec
thresholdX_gtabs_s (a)	threshold_gtabs_s_C	ippsThreshold_GTAbs_32s	thresholdX_gtabs_s_vec
thresholdX_lt_s (a)	threshold_lt_s_C	ippsThreshold_LT_32s	thresholdX_lt_s_vec
thresholdX_ltabs_s (a)	threshold_ltabs_s_C	ippsThreshold_LTAbs_32s	thresholdX_ltabs_s_vec
thresholdX_ltval_gtval_s (a)	threshold_ltval_gtval_s_C	ippsThreshold_LTValGTVal_32s	threshold_ltval_gtval_s_vec
copyXs (a)	copys_C	ippsCopy_32s	copys_vec
?	?	?	mulcs_vec
absdiff16s_Xs (a)	absdiff16s_c	?	absdiff16s_vec
sum16s32sX (a)	sum16s32s_C	ippsSum_16s32s_Sfs	sum16s32s_vec
?	ors_c	ippsOr_32u	?
?	ands_c	ippsAnd_32u	?
sigmoidXf (a)	sigmoidf_C	?	sigmoidf_vec
PReluXf (a)	PReluf_C	?	PReluf_vec
softmaxXf (a)	softmaxf_C	?	softmaxf_vec
pol2cart2DXf (a)	pol2cart2Df_C	?	pol2cart2Df_vec
cart2pol2DXf (a)	cart2pol2Df_C	?	cart2pol2Df_vec
gatheri_256/512s	gatheri_C	?	?
fp32tofp16128/256	fp32tofp16_C	?	?
fp16tofp32128/256	fp16tofp32_C	?	?
?	floodFill_4C_8u	ippiFloodFill_4Con_8u_C1IR	?
?	floodFill_4C_32s	ippiFloodFill_4Con_32s_C1IR	?
?	floodFill_8C_8u	ippiFloodFill_8Con_8u_C1IR	?
?	floodFill_8C_32s	ippiFloodFill_8Con_32s_C1IR	?
?	floodFill_8C_32f	ippiFloodFill_8Con_32f_C1IR	?

Licence

This library is released under BSD licence so that everyone can freely use it in their project, find bugs, propose new functions or enhance existing ones.