Home

Awesome

simd_utils

A header only library implementing common mathematical functions using SIMD intrinsics. This library is C/C++ compatible (tested with GCC7.5/9.3/10.2/11.3/12.0, clang 9 and icc 2021).

Thanks to Julien Pommier and Giovanni Garberoglio for their work on sin,cos,log, and exp functions in SSE, AVX, and NEON intrinsics. Thanks to the DLTcollab team for their work on sse2neon.

What is SIMD Utils?

The purpose of this library is to give an open-source implementation of SIMD optimized commonly used algorithms, such as type conversion (float32, float64, uint16, ...), trigonometry (sin, cos, atan, ...), log/exp, min/max, and other functions. Its API was thought as a simple replacement for Intel IPP/MKL libraries. Some of the functions are vectorised version of the cephes maths library (https://www.netlib.org/cephes/)

Why use SIMD Utils?

Targets

Supported targets are :

128 bit functions (SSE, NEON, ALTIVEC) are name function128type, such as asin128f, which computes the arcsinus function on an float32 array. Float64 functions have the "d" suffix. 256 bit functions (AVX/AVX2) have 256 instead of 128 in their name, such as asin256f. 256 bit functions (AVX512) have 512 instead of 128 in their name, such as cos512f. Vector functions (RISCV) for which the SIMD length makes less sense, are name functionType_vec, such as subs_vec, which substract an int32 array from and other one.

The project has been tested on :

Building

To build the project you will need the sse_mathfun.h, avx_mathfun.h and neon_mathun.h headers available here http://gruntthepeon.free.fr/ssemath/, and there http://software-lisc.fbk.eu/avx_mathfun/ This project also uses a forked version of sse2neon (https://github.com/DLTcollab/sse2neon) adding functions such as double precision and Fused Multiple Add.

Simply include simd_utils.h in your C/C++ file, and compile with :

For FMA support you need to add -DFMA and -mfma to x86 targets, and -DFMA to Armv8 targets. For ARMV7 targets, you could also add -DSSE2NEON_PRECISE_SQRT for improved accuracy with sqrt and rsqrt For X86 targets with ICC compiler, simply add -DICC to activate Intel SVML intrinsics. Altivec support is intended mostly for older Big Endian PowerPC. Newer Little Endian might benefit from a direct conversion from SSE similar to sse2neon.

OpenCL (experimental)

The same approach is applied to OpenCL kernels as an experiment, focused on GPUs, but other OpenCL devices may work. At the moment only some functions are supported (log, exp, sincos, tan, atan, atan2, asin, sqrt), based on the cephes library, which seems to be faster that the OpenCL native functions (tested on Intel GPU with beignet 1.3) To try it out, simply use :

Supported Functions

SSE/NEON are 128bits wide. SSE functions use up to SSE4.2 features. Some functions are directly coded using NEON intrinsics (for performance reasons), but most functions translate SSE code to NEON using sse2neon header. Some AVX functions, such as integer ones, require AVX2. The 256 bit integer functions are emulated using SSE for some floating point functions if AVX2 is unavailable. Altivec implemented functions are indicated with "(a)".

The following table is a work in progress, "?" means there is not yet an implemented function (or a directly equivalent Intel IPP function) :

SSE/NEON/ALTIVEC (X=128), AVX (X=256), AVX512 (X=512)C_REFIPP_REFRISCV
log10Xf/precise (a)log10f_CippsLog10_32f_A24log10f_vec
log2Xf/precise (a)log2f_Clog2f_vec
lnXf (a)lnf_CippsLn_32f_A24lnf_vec
lnXdln_CippsLn_64f_A53?
expXf (a)expf_CippsExp_32f_A24expf_vec
cbrtXf (a)cbrtf_C?cbrtf_vec
fabsXf (a)fabsf_CippsAbs_32ffabsf_vec
setXf (a)setf_CippsSet_32fsetf_vec
zeroXf (a)zerof_CippsZero_32fzerof_vec
copyXf (a)copyf_CippsCopy_32fcopyf_vec
addXf (a)addf_cippsAdd_32faddf_vec
mulXf (a)mulf_CippsMul_32fmulf_vec
subXf (a)subf_cippsSub_32fsubf_vec
addcXf (a)addcf_CippsAddC_32faddcf_vec
mulcXf (a)mulcf_CippsMulC_32fmulcf_vec
muladdXfmuladdf_C?muladdf_vec
mulcaddXfmulcaddf_C?mulcaddf_vec
mulcaddcXfmulcaddcf_C?mulcaddcf_vec
muladdcXfmuladdcf_C?muladdcf_vec
divXf (a)divf_CippsDiv_32f_A24divf_vec
dotXf (a)dotf_CippsDotProd_32fdotf_vec
dotcXf (a)dotcf_CippsDotProd_32fcdotcf_vec
vectorSlopeXf (a)vectorSlopef_CippsVectorSlope_32fvectorSlopef_vec
convertFloat32ToU8_X (a)convertFloat32ToU8_CippsConvert_32f8u_SfsconvertFloat32ToU8_vec
convertFloat32ToU16_X (a)convertFloat32ToI16_CippsConvert_32f16u_SfsconvertFloat32ToU16_vec
convertFloat32ToI16_X (a)convertFloat32ToI16_CippsConvert_32f16s_SfsconvertFloat32ToI16_vec
convertInt16ToFloat32_X (a)convertInt16ToFloat32_CippsConvert_16s32f_SfsconvertInt16ToFloat32_vec
cplxtorealXf (a)cplxtorealf_CippsCplxToReal_32fccplxtorealf_vec
realtocplxXf (a)realtocplx_CippsRealToCplx_32frealtocplxf_vec
convertX_64f32fconvert_64f32f_CippsConvert_64f32fconvert_64f32f_vec
convertX_32f64fconvert_32f64f_CippsConvert_32f64fconvert_32f64f_vec
flipXf (a)flipf_CippsFlip_32fflipf_vec
maxeveryXf (a)maxeveryf_cippsMaxEvery_32fmaxeveryf_vec
mineveryXf (a)mineveryf_cippsMinEvery_32fmineveryf_vec
minmaxXf (a)minmaxf_cippsMinMax_32fminmaxf_vec
thresholdX_gt_f (a)threshold_gt_f_CippsThreshold_GT_32fthreshold_gt_f_vec
thresholdX_gtabs_f (a)threshold_gtabs_f_CippsThreshold_GTAbs_32fthreshold_gtabs_f_vec
thresholdX_lt_f (a)threshold_lt_f_CippsThreshold_LT_32fthreshold_lt_f_vec
thresholdX_ltabs_f (a)threshold_ltabs_f_CippsThreshold_LTAbs_32fthreshold_ltabs_f_vec
thresholdX_ltval_gtval_f (a)threshold_ltval_gtval_f_CippsThreshold_LTValGTVal_32fthreshold_ltval_gtval_f_vec
sinXfsinf_CippsSin_32f_A24sinf_vec
cosXfcosf_CippsCos_32f_A24cosf_vec
sincosXf (a)sincosf_CippsSinCos_32f_A24sincosf_vec
sincosXf_interleaved (a)sincosf_C_interleavedippsCIS_32fc_A24sincosf_interleaved_vec
coshXf (a)coshf_CippsCosh_32f_A24coshf_vec
sinhXf (a)sinhf_CippsSinh_32f_A24sinhf_vec
acoshXf (a)acoshf_CippsAcosh_32f_A24acoshf_vec
asinhXf (a)asinhf_CippsAsinh_32f_A24asinhf_vec
atanhXf (a)atanhf_CippsAtanh_32f_A24atanh_vec
atanXf (a)atanf_CippsAtan_32f_A24atanf_vec
atan2Xf (a)atan2f_CippsAtan2_32f_A24atan2f_vec
atan2Xf_interleaved (a)atan2f_interleaved_C?atan2f_interleaved_vec
asinXf (a)asinf_CippsAsin_32f_A24asinf_vec
tanhXf (a)tanhf_CippsTanh_32f_A24tanhf_vec
tanXf (a)tanf_CippsTan_32f_A24tanf_vec
tanXd (a)tan_CippsTan_64f_A53?
magnitudeXf_split (a)magnitudef_C_splitippsMagnitude_32fmagnitudef_split_vec
powerspectXf_split (a)powerspectf_C_splitippsPowerSpectr_32fpowerspectf_split_vec
magnitudeXf_interleavedmagnitudef_C_interleavedippsMagnitude_32fcmagnitudef_interleaved_vec
powerspectXf_interleavedpowerspectf_C_interleavedippsPowerSpectr_32fcpowerspectf_interleaved_vec
subcrevXf (a)subcrevf_CippsSubCRev_32fsubcrevf_vec
sumXf (a)sumf_CippsSum_32fsumf_vec
meanXf (a)meanf_CippsMean_32fmeanf_vec
sqrtXf (a)sqrtf_CippsSqrt_32fsqrtf_vec
roundXf (a)roundf_CippsRound_32froundf_vec
rintXf (a)rintf_C?rintf_vec
ceilXf (a)ceilf_CippsCeil_32fceilf_vec
floorXf (a)floorf_CippsFloor_32ffloorf_vec
truncXf (a)truncf_CippsTrunc_32ftruncf_vec
modfXf (a)modff_CippsModf_32fmodf_vec
cplxvecmulXf (a)cplxvecmul_C/preciseippsMul_32fc_A11/24cplxvecmulf_vec
cplxvecmulXf_split (a)cplxvecmul_C_split/precise?cplxvecmulf_vec_split
cplxconjvecmulXf (a)cplxconjvecmul_CippsMulByConj_32fc_A24cplxconjvecmulf_vec
cplxconjvecmulXf_splitcplxconjvecmul_C_split?cplxconjvecmulf_vec_split
cplxconjXf (a)cplxconj_CippsConj_32fc_A24cplxconjf_vec
cplxvecdivXf (a)cplxvecdiv_C?cplxvecdivf_vec
cplxvecdivXf_split (a)cplxvecdiv_C_split?cplxvecdivf_vec_split
setXdsetd_CippsSet_64fsetd_vec
zeroXdzerod_CippsZero_64fzerod_vec
copyXdcopyd_CippsCopy_64fcopyd_vec
sqrtXdsqrtd_CippsSqrt_64fsqrtd_vec
addXdaddd_cippsAdd_64faddd_vec
mulXdmuld_cippsMul_64fmuld_vec
subXdsubd_cippsSub_64fsubd_vec
divXddivd_cippsDiv_64fdivd_vec
addcXdaddcd_CippsAddC_64faddcd_vec
mulcXdmulcd_CippsMulC_64fmulcd_vec
muladdXdmuladdd_C?muladdd_vec
mulcaddXdmulcaddd_C?muladdcd_vec
mulcaddcXdmulcaddcd_C?mulcaddcd_vec
muladdcXdmuladdcd_C?muladdcd_vec
roundXdroundd_CippsRound_64froundd_vec
rintXdrintd_C?rintd_vec
ceilXdceild_CippsCeil_64fceild_vec
floorXdfloord_CippsFloor_64ffloord_vec
truncXdtruncd_CippsTrunc_64ftruncd_vec
vectorSlopeXdvectorSloped_CippsVectorSlope_64fvectorSloped_vec
sincosXdsincosd_CippsSinCos_64f_A53sincosd_vec
sincosXd_interleavedsincosd_C_interleavedippsCIS_64fc_A53sincosd_interleaved_vec
atanXdatan_CippsAtan_64f_A53atand_vec
atan2Xdatan2d_CippsAtan2_64f_A53atan2d_vec
atan2Xd_interleavedatan2_interleaved_C?atan2d_interleaved_vec
asinXdasin_CippsAsin_64f_A53asind_vec
cplxtorealXdcplxtoreald_CippsCplxToReal_64fccplxtoreald_vec
realtocplxXdrealtocplxd_CippsRealToCplx_64frealtocplxd_vec
expXdexp_CippsExp_64f_A53expd_vec
addXs (a)adds_c?adds_vec
mulXsmuls_c?muls_vec
subXs (a)subs_c?subs_vec
addcXs (a)addcs_C?addcs_vec
vectorSlopeXs (a)vectorSlopes_CippsVectorSlope_32svectorSlopes_vec
flipXs (a)flips_C?flips_vec
maxeveryXs (a)maxeverys_c?maxeverys_vec
mineveryXs (a)mineverys_c?mineverys_vec
minmaxXs (a)minmaxs_cippsMinMax_32sminmaxs_vec
thresholdX_gt_s (a)threshold_gt_s_CippsThreshold_GT_32sthresholdX_gt_s_vec
thresholdX_gtabs_s (a)threshold_gtabs_s_CippsThreshold_GTAbs_32sthresholdX_gtabs_s_vec
thresholdX_lt_s (a)threshold_lt_s_CippsThreshold_LT_32sthresholdX_lt_s_vec
thresholdX_ltabs_s (a)threshold_ltabs_s_CippsThreshold_LTAbs_32sthresholdX_ltabs_s_vec
thresholdX_ltval_gtval_s (a)threshold_ltval_gtval_s_CippsThreshold_LTValGTVal_32sthreshold_ltval_gtval_s_vec
copyXs (a)copys_CippsCopy_32scopys_vec
???mulcs_vec
absdiff16s_Xs (a)absdiff16s_c?absdiff16s_vec
sum16s32sX (a)sum16s32s_CippsSum_16s32s_Sfssum16s32s_vec
?ors_cippsOr_32u?
?ands_cippsAnd_32u?
sigmoidXf (a)sigmoidf_C?sigmoidf_vec
PReluXf (a)PReluf_C?PReluf_vec
softmaxXf (a)softmaxf_C?softmaxf_vec
pol2cart2DXf (a)pol2cart2Df_C?pol2cart2Df_vec
cart2pol2DXf (a)cart2pol2Df_C?cart2pol2Df_vec
gatheri_256/512sgatheri_C??
fp32tofp16128/256fp32tofp16_C??
fp16tofp32128/256fp16tofp32_C??
?floodFill_4C_8uippiFloodFill_4Con_8u_C1IR?
?floodFill_4C_32sippiFloodFill_4Con_32s_C1IR?
?floodFill_8C_8uippiFloodFill_8Con_8u_C1IR?
?floodFill_8C_32sippiFloodFill_8Con_32s_C1IR?
?floodFill_8C_32fippiFloodFill_8Con_32f_C1IR?

Licence

This library is released under BSD licence so that everyone can freely use it in their project, find bugs, propose new functions or enhance existing ones.