Home

Awesome

veclib

Vector library for porting SSE2 instructions to other architectures

Credit to yhcheng for first providing a PPC64 port of the Burrows-Wheeler Aligner which helped me expand on the instructions. PPC64 Altivec instructions are provided in vec128int.h which depends on veclib_types.h.

To use in a project that uses SSE2 instructions simply place vec128int.h, vec128intlib.c, vec128intlib.h, and vec_libtypes.h in the same directory your C/C++ source file calling SSE2 instructions is in. Replace all instances of #include <emmintrin.h> with #include "vec128intlib.h". Add vec128intlib.o to the makefile and it should compile fine. Note that not all SSE2 instructions have been ported over so you may get errors for those functions. Also some instructions only work on POWER8 or newer. Those instructions are marked in vec128intlib.c and vec128intlib.h.

The IBM POWER vector intrinsinc functions library can be downloaded from the IBM developerWorks website.

Below are some integer SSE2 instructions I've managed to port to Altivec using veclib-1.0.4.

For big-endian platforms only the shift left/right operations are reversed.

Vector Integer Operations

Load

veclibSSE2AltivecDescription
_mm_load_si128 (__m128i const* mem_addr)vec_load1qLoad 128 bits of integer data, aligned
_mm_loadu_si128 (__m128i const* mem_addr)vec_loadu1qLoad 128 bits of integer data, unaligned
_mm_loadl_epi64 (__m128i const* mem_addr)vec_loadlower1sdLoad 64-bits of integer data to lower part and zero upper part

Set

veclibSSE2AltivecDescription
_mm_setzero_si128 ()vec_zero1qSet 128 integer bits to zero
_mm_set1_epi8 (char a)vec_splat16sbSplat 8-bit char to 16 8-bit chars
_mm_set1_epi16 (short a)vec_splat8shSplat 16-bit short to 8 16-bit shorts
_mm_set1_epi32 (int a)vec_splat4swSplat 32-bit ints to 4 32-bit ints
_mm_set1_epi64 (__m64 a)vec_splat2sdSplat 64-bit long long to 2 64-bit long longs
_mm_set_epi8 (char a)vec_set16sbSet 16 8-bit chars
_mm_set_epi16 (short a)vec_set8shSet 8 16-bits shorts
_mm_set_epi32 (int a)vec_set4swSet 4 32-bit ints
_mm_set_epi64 (__m64 a)vec_set2sdSet 2 64-bit long longs
_mm_setr_epi8 (char a)vec_setreverse16sbSet 16 8-bit chars reversed
_mm_setr_epi16 (short a)vec_setreverse8shSet 8 16-bit shorts reversed
_mm_setr_epi32 (int a)vec_setreverse4swSet 4 32-bit ints reversed
_mm_setr_epi64 (__m64 a)vec_setreverse2sdSet 2 64-bit long longs reversed
_mm_movpi64_epi64 (__m64 a)vec_ZerouppersdSet lower 64-bits of integer data and zero upper part

Store

veclibSSE2AltivecDescription
_mm_store_si128 (__m128i* mem_addr, __m128i a)vec_store1qStore 128-bits integer, aligned
_mm_storeu_si128 (__mm128i* mem_addr, __m128i a)vec_storeu1qStore 128-bits integer, unaligned
_mm_storel_epi64 (__m128i* mem_addr, __m128i a)vec_storelower1sdof2sdStore lower 64-bit long long

Insert

veclibSSE2AltivecDescription
vec_insert4sw (__m128i into, int from, const intlit2 element_from_right)Insert 32-bit int
vec_insert2sd (__m128i into, long long from, const intlit1 element_from_right)Insert 64-bit long long
vec_convert1swto1uq (int from)Insert 32-bit int, zeroing upper
_mm_insert_epi16 (__m128i a, int i, int imm8)vec_insert8shInsert 16-bit short into one of 8 16-bit shorts
vec_insert16ub (__m128i v, int scalar, intlit4 element_from_right)Insert 8-bit unsigned char into one of 16 bytes

Extract

veclibSSE2AltivecDescription
vec_extract1swfrom4sw (__m128i from, const intlit2 element_from_right)Extract 32-bit int
vec_extract1sdfrom2sd (__m128i from, const intlit1 element_from_right)Extract 64-bit long long
_mm_extract_epi16 (__m128i a, int imm8)vec_extract8shExtract 16-bit short from one of 8 16-bit shorts
_mm_movemask_epi8 (__m128i a)vec_extractupperbit16sbExtract upper bit of 16 8-bit chars
vec_extractupperbit2dp (__m128d v)Extract upper bit of 2 64-bit doubles
vec_extractlowersw (__m128i from)Extract lower 32-bit int

Convert integer to integer

veclibSSE2AltivecDescription
_mm_packs_epi16 (__m128i a, __m128i b)vec_packs8hto16sbConvert 8+8 16-bit shorts to 16 8-bit chars with signed saturation
_mm_packs_epi32 (__m128i a, __m128i b)vec_packs4wto8shConvert 4+4 32-bit ints to 8 16-bit shorts with signed saturation
_mm_packus_epi16 (__m128i a, __m128i b)Convert packed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst

Convert floating-point to integer

veclibSSE2AltivecDescription
_mm_cvttps_epi32 (__m128 a)vec_converttruncating4spto4swConvert 4 32-bit floats to 4 32-bit ints with truncation
_mm_cvtps_epi32 (__m128 a)vec_convert4spto4swConvert 4 32-bit floats to 4 32-bit ints
_mm_cvttpd_epi32 (__m128d a)vec_Convert2dpto2swConvert 2 64-bit doubles to 2 32-bit ints with truncation
_mm_cvtpd_epi32 (__m128d a)Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst
_mm_cvtsi32_si128 (int a)Copy 32-bit integer a to the lower element of dst, and zero the upper element
_mm_cvtsi64_si128 (int64 a)Copy 64-bit integer a to the lower element of dst, and zero the upper element

Arithmetic

veclibSSE2AltivecDescription
_mm_add_epi8 (__m128i a, __m128i b)vec_add16sbAdd 16 8-bit chars
_mm_add_epi16 (__m128i a, __m128i b)vec_add8shAdd 8 16-bit shorts
_mm_add_epi32 (__m128i a, __m128i b)vec_add4swAdd 4 32-bit ints
_mm_add_epi64 (__m128i a, __m128i b)vec_add2sdAdd 2 64-bit long longs
_mm_adds_epi8 (__m128i a, __m128i b)vec_addsaturating16sbAdd 16 8-bit chars with signed saturation
_mm_adds_epu8 (__m128i a, __m128i b)vec_addsaturating16ubAdd 16 8-bit chars with unsigned saturation
_mm_adds_epi16 (__m128i a, __m128i b)vec_addsaturating8shAdd 8 16-bit shorts with signed saturation
_mm_adds_epu16 (__m128i a, __m128i b)vec_addsaturating8uhAdd 8 16-bit shorts with unsigned saturation
_mm_sub_epi8 (__m128i a, __m128i b)vec_subtract16sbSubtract 16 8-bit chars
_mm_sub_epi16 (__m128i a, __m128i b)vec_subtract8shSubtract 8 16-bit shorts
_mm_sub_epi32 (__m128i a, __m128i b)vec_subtract4swSubtract 4 32-bit ints
_mm_sub_epi64 (__m128i a, __m128i b)vec_subtract2sdSubtract 2 64-bit long longs
_mm_subs_epi8 (__m128i a, __m128i b)vec_subtractsaturating16sbSubtract 16 8-bit chars with signed saturation
_mm_subs_epu8 (__m128i a, __m128i b)vec_subtractsaturating16ubSubtract 16 8-bit chars with unsigned saturation
_mm_subs_epi16 (__m128i a, __m128i b)vec_subtractsaturating8shSubtract 8 16-bit shorts with signed saturation
_mm_subs_epu16 (__m128i a, __m128i b)vec_subtractsaturating8uhSubtract 8 16-bit shorts with unsigned saturation
_mm_mul_epu32 (__m128i a, __m128i b)vec_multiplylower2uwto2udMultiply lower 32-bit unsigned ints producing 2 64-bit unsigned long longs
_mm_madd_epi16 (__m128i a, __m128i b)vec_multiply8shMultiply 8 16-bit signed shorts
_mm_mulhi_epi16 (__m128i a, __m128i b)Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst
_mm_mulhi_epu16 (__m128i a, __m128i b)Multiply the packed unsigned 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst
_mm_mullo_epi16 (__m128i a, __m128i b)Multiply the packed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the low 16 bits of the intermediate integers in dst
_mm_avg_epu8 (__m128i a, __m128i b)vec_average16ubAverage 16 8-bit unsigned chars
_mm_avg_epu16 (__m128i a, __m128i b)vec_average8uhAverage 8 16-bit unsigned shorts
_mm_max_epi16 (__m128i a, __m128i b)vec_max8shMax 8 16-bit shorts
_mm_max_epu8 (__m128i a, __m128i b)vec_max16ubMax 16 8-bit unsigned chars
_mm_min_epu8 (__m128i a, __m128i b)vec_min16ubMin 16 8-bit unsigned chars
_mm_min_epi16 (__m128i a, __m128i b)vec_min8shMin 8 16-bit shorts
vec_sumabsdiffs16ub (__m128i left, __m128i right)Sum 2 octets of absolute differences of 16 8-bit unsigned chars into 2 64-bit long longs
vec_summultiply4sh (__m128i left, __m128i right)Multiply 4 16-bit shorts then add adjacent pairs with saturation to 4 32-bit ints
vec_Abs16sb (__m128i a)Absolute value 16 8-bit chars
vec_Abs8sh (__m128i a)Absolute value 8 16-bit shorts
vec_Abs4sw (__m128i a)Absolute value 4 32-bit ints
vec_horizontalAddsaturating8sh (__m128i left, __m128i right)Horizontally add 4+4 adjacent pairs of 16-bit shorts with saturation - (a0+a1, a2+a3, a4+a5, a6+a7, b0+b1, b2+b3, b4+b5, b6+b7)
vec_horizontalSub8sh (__m128i left, __m128i right)Horizontally subtract 4+4 adjacent pairs of 16-bit shorts to 8 16-bit shorts - (a0-a1, a2-a3, a4-a5, a6-a7, b0-b1, b2-b3, b4-b5, b6-b7)
vec_partialhorizontalsubtract2sw (__m128i left, __m128i right)Horizontally subtract 2+2 adjacent pairs of 32-bit ints to 4 32-bit ints
vec_horizontalSubtractsaturating8sh (__m128i left, __m128i right)Horizontally subtract 4+4 adjacent pairs of 16-bit shorts to 8-bit shorts with saturation - (a0+a1, a2+a3, a4+a5, a6+a7, b0+b1, b2+b3, b4+b5, b6+b7)
vec_Multiply16sbthenhorizontalAddsaturating8sh (__m128i left, __m128i right)Multiply 16 8-bit u*s chars then add adjacent 16-bit products with signed saturation
vec_Multiply8shExtractUpper (__m128i left, __m128i right)Multiply 8 16-bit shorts, shift right 14, add 1 and shift right 1 to 8 16-bit shorts
vec_conditionalNegate16sb (__m128i left, __m128i right)Negate 16 8-bit chars when mask is negative, zero when zero, else copy
vec_conditionalNegate8sh (__m128i left, __m128i right)Negate 8 16-bit shorts when mask is negative, zero when zero, else copy
vec_conditionalNegate4sw (__m128i left, __m128i right)Negate 4 32-bit ints when mask is negative, zero when zero, else copy
vec_multiply4sw (__m128i left, __m128i right)Multiply 4 32-bit signed ints
vec_Max4sw (__m128i left, __m128i right)Max 4 32-bit signed ints
vec_Min4sw (__m128i left, __m128i right)Min 4 32-bit signed ints

Boolean

veclibSSE2AltivecDescription
_mm_and_si128 (__m128i a, __m128i b)vec_bitand1qBitwise 128-bit and
_mm_andnotsi128 (__m128i a, __m128i b)vec_bitandnotleft1qBitwise 128-bit and not (reversed)
_mm_or_si128 (__m128i a, __m128i b)vec_bitor1qBitwise 128-bit or
_mm_xor_si128 (__m128i a, __m128i b)vec_bitxor1qBitwise 128-bit xor

Unpack

veclibSSE2AltivecDescription
_mm_unpackhi_epi8 (__m128i a, __m128i b)vec_unpackhigh8sbUnpack 8+8 8-bit chars from high halves and interleave
_mm_unpackhi_epi16 (__m128i a, __m128i b)vec_unpackhigh4shUnpack 4+4 16-bit shorts from high halves and interleave
_mm_unpacklo_epi8 (__m128i a, __m128i b)vec_unpacklow8sbUnpack 8+8 8-bit chars from low halves and interleave
_mm_unpacklo_epi16 (__m128i a, __m128i b)vec_unpacklow4shUnpack 4+4 16-bit shorts from low halves and interleave
_mm_unpacklo_epi32 (__m128i a, __m128i b)vec_unpacklow2swUnpack 2+2 32-bit ints from low halves and interleave
_mm_unpackhi_epi32 (__m128i a, __m128i b)vec_unpackhigh2swUnpack 2+2 32-bit ints from high halves and interleave
_mm_unpacklo_epi64 (__m128i a, __m128i b)vec_unpacklow1sdUnpack 1+1 64-bit long longs from low halves and interleave
_mm_unpackhi_epi64 (__m128i a, __m128i b)vec_unpackhigh1sdUnpack 1+1 64-bit long longs from high halves and interleave

Shift

veclibSSE2AltivecDescription
_mm_sll_epi16 (__m128i a, int imm8)vec_shiftleft8shShift 8 16-bit shorts left logical
_mm_sll_epi32 (__m128i a, int imm8)vec_shiftleft4swShift 4 32-bit ints left logical
_mm_sll_epi64 (__m128i a, int imm8)vec_shiftleft2sdShift 2 64-bit long longs left logical
_mm_slli_epi16 (__m128i a, int imm8)vec_shiftleftimmediate8shShift 8 16-bit shorts left logical immediate
_mm_slli_epi32 (__m128i a, int imm8)vec_shiftleftimmediate4swShift 4 32-bit ints left logical immediate
_mm_slli_epi64 (__m128i a, int imm8)vec_shiftleftimmediate2sdShift 2 64-bit long longs left logical immediate
_mm_slli_si128 (__m128i a, int imm8)vec_shiftleftbytes1qShift 128-bits left logical immediate by bytes
_mm_srl_epi16 (__m128i a, int imm8)vec_shiftright8shShift 8 16-bit shorts right logical
_mm_srl_epi32 (__m128i a, int imm8)vec_shiftright4swShift 4 32-bit ints right logical
_mm_srl_epi64 (__m128i a, int imm8)vec_shiftright2sdShift 2 64-bit long longs right logical
_mm_srli_epi16 (__m128i a, int imm8)vec_shiftrightimmediate8shShift 8 16-bit shorts right logical immediate
_mm_srli_epi32 (__m128i a, int imm8)vec_shiftrightimmediate4swShift 4 32-bit ints right logical immediate
_mm_srli_epi64 (__m128i a, int imm8)Shift 2 64-bit long longs right logical immediate
_mm_srli_si128 (__m128i a, int imm8)vec_shiftrightbytes1qShift 128-bits right logical immediate by bytes
vec_shiftrightarithmetic4wimmediate (__m128i v, intlit8 count)Shift 4 32-bit ints right arithmetic
vec_shiftrightlogical2dimmediate (__m128i v, intlit8 count)Shift 2 64-bit long longs right logical immediate
vec_shiftrightarithmetic8himmediate (__m128i v, intlit8 count)Shift 8 16-bit shorts right arithmetic
vec_shiftrightarithmetic4sw (__m128i v, intlit8 count)Shift 4 32-bit ints left logical
vec_shiftright2dqw (__m128i left, __m128i right, int const count)Shift 128+128-bits right into 128-bits
_mm_bslli_si128 (__m128i a, int imm8)Shift a left by imm8 bytes while shifting in zeros, and store the results in dst
_mm_bsrli_si128 (__m128i a, int imm8)Shift a right by imm8 bytes while shifting in zeros, and store the results in dst

Permute

veclibSSE2AltivecDescription
_mm_shufflehi_epi16 (__m128i a, int imm8)Shuffle 16-bit integers in the high 64 bits of a using the control in imm8. Store the results in the high 64 bits of dst, with the low 64 bits being copied from from a to dst
_mm_shufflelo_epi16vec_permutelower4shShuffle lower 4 16-bit shorts using mask, leaving upper half unchanged
_mm_shuffle_epi32 (__m128i a, int imm8)vec_permute4swShuffle 4 32-bit ints using mask
vec_permute16sb (__m128i v, __m128i mask)Shuffle 16 8-bit chars using mask

Compare

veclibSSE2AltivecDescription
_mm_cmpeq_epi8 (__m128i a, __m128i b)vec_compareeq16sbCompare 16 8-bit chars for == to vector mask
_mm_cmpeq_epi16 (__m128i a, __m128i b)vec_compareeq8shCompare 8 16-bit shorts for == to vector mask
_mm_cmpeq_epi32 (__m128i a, __m128i b)vec_compareeq4swCompare 4 32-bit ints for == to vector mask
_mm_cmplt_epi8 (__m128i a, __m128i b)vec_comparelt16sbCompare 16 8-bit chars for < to vector mask
_mm_cmplt_epi16 (__m128i a, __m128i b)vec_comparelt8shCompare 8 16-bit shorts for < to vector mask
_mm_cmplt_epi32 (__m128i a, __m128i b)vec_comparelt4swCompare 4 32-bit ints for < to vector mask
_mm_cmpgt_epi8 (__m128i a, __m128i b)vec_comparegt16sbCompare 16 8-bit chars for > to vector mask
_mm_cmpgt_epi16 (__m128i a, __m128i b)vec_comparegt8shCompare 8 16-bit shorts for > to vector mask
_mm_cmpgt_epi32 (__m128i a, __m128i b)vec_comparegt4swCompare 4 32-bit ints for > to vector mask

Cast

veclibSSE2AltivecDescription
_mm_castps_si128 (__m128 a)vec_cast4spto1qCast __m128 to __m128i
_mm_castpd_si128 (__m128d a)vec_Cast2dpto4swCast __m128d to __m128i