Home

Awesome

VectorTraits

English | Chinese(中文)

VectorTraits: SIMD Vector type traits methods (SIMD向量类型的特征方法).

NuGet

This library provides many important arithmetic methods(e.g. Shift, Shuffle, NarrowSaturate) and constants for vector types, making it easier for you to write cross-platform SIMD code. It takes full advantage of the X86 and Arm architectures' intrinsic functions to achieve hardware acceleration and can enjoy inline compilation optimization.

Commonly Used Types:

Traits methods:

Supported instruction set:

Purpose

The SIMD instruction set is known to accelerate multimedia processing (graphics, images, audio, video, ...) , artificial intelligence, scientific computing, etc. However, traditional SIMD programming suffers from the following pain points.

NET Core 1.0 in 2016 added vector types such as Vector<T>, which largely solves the above pain points.

The vector type Vector<T> although well designed, it lacks many important vector functions such as Ceiling, Sum, Shift, Shuffle, etc. This led to many algorithms that were difficult to implement with vector types. When .NET platform versions are upgraded, sometimes several vector methods are added. .NET 7.0 released in 2022, for example, added ShiftRightArithmetic, Shuffle and other methods. However, there are still few vector methods, such as the lack of saturation processing. To address the lack of vector methods, .NET Core 3.0 starts to support intrinsic functions. This allows developers to use the SIMD instruction set directly, but again, this faces problems such as difficulty in cross-platform and bit-width upgrades. As the .NET platform is upgraded, more intrinsic functions will be added. For example, .NET 5.0 adds intrinsic functions for the Arm platform. For developing libraries, you can't just support .NET 7.0, but you need to support multiple .NET versions. So you will face tedious version checking and conditional processing. And the highest version of the .NET Standard class library (2.1) still does not support vector methods like Ceiling, which makes version checking even more tedious.

This library is dedicated to solve the above troubles, so that you can write cross-platform SIMD algorithms more easily. Feature:

Tip: The Disassembly window in Visual Studio allows you to view the assembly code at runtime. For example, when running on a machine that supports the Avx instruction set, Vectors.ShiftLeft_Const will be compiled inline and optimized to use the vpsllw instruction. And for constant value(1), it will be compiled as the immediate number of the instruction. Vectors.ShiftLeft_use_inline.png

Example 2: Using Vectors.ShiftLeft_Args and Vectors.ShiftLeft_Core, you can move some of the operations outside the loop to be processed earlier. For example, when running on a machine that supports the Avx instruction set, xmm1 is set outside the loop, and then used it in the vpsllw instruction of the inner loop. And here it is shown: the inline compilation optimization eliminates redundant xmm/ymm conversions. Vectors.ShiftLeft_Core_use_inline.png

Getting started

1) Install via NuGet

Either open the 'Package Management Console' and enter the following or use the built-in GUI

NuGet: PM> Install-Package VectorTraits

2) Usage examples

The static class Vectors provides some methods. e.g. CreateRotate, ShiftLeft, Shuffle. The generic structure 'Vectors<T>' provides fields for commonly used constants.

The example code is in the samples/VectorTraits.Sample folder. The source code is as follows.

using System;
using System.IO;
using System.Numerics;
#if NETCOREAPP3_0_OR_GREATER
using System.Runtime.Intrinsics;
#endif
using Zyl.VectorTraits;

namespace Zyl.VectorTraits.Sample {
    class Program {
        private static readonly TextWriter writer = Console.Out;
        static void Main(string[] args) {
            writer.WriteLine("VectorTraits.Sample");
            writer.WriteLine();
            VectorTraitsGlobal.Init(); // Initialization .
            TraitsOutput.OutputEnvironment(writer); // Output environment info. It depends on `VectorTraits.InfoInc`. This row can be deleted when only VectorTraits are used.
            writer.WriteLine();

            // -- Start --
            Vector<short> src = Vectors.CreateRotate<short>(0, 1, 2, 3, 4, 5, 6, 7); // The `Vectors` class provides some methods. For example, 'CreateRotate' is rotate fill .
            VectorTextUtil.WriteLine(writer, "src:\t{0}", src); // It can not only format the string, but also display the hexadecimal of each element in the vector on the right Easy to view vector data .

            // ShiftLeft. It is a new vector method in `.NET 7.0`
            const int shiftAmount = 1;
            Vector<short> shifted = Vectors.ShiftLeft(src, shiftAmount); // shifted[i] = src[i] << shiftAmount.
            VectorTextUtil.WriteLine(writer, "ShiftLeft:\t{0}", shifted);
#if NET7_0_OR_GREATER
            // Compare BCL function .
            Vector<short> shiftedBCL = Vector.ShiftLeft(src, shiftAmount);
            VectorTextUtil.WriteLine(writer, "Equals to BCL ShiftLeft:\t{0}", shifted.Equals(shiftedBCL));
#endif
            // ShiftLeft_Const
            VectorTextUtil.WriteLine(writer, "Equals to ShiftLeft_Const:\t{0}", shifted.Equals(Vectors.ShiftLeft_Const(src, shiftAmount))); // If the parameter shiftAmount is a constant, you can also use the Vectors' ShiftLeft_Const method. It is faster in many scenarios .
            writer.WriteLine();

            // Shuffle. It is a new vector method in `.NET 7.0`
            Vector<short> desc = Vectors<short>.SerialDesc; // The generic structure 'Vectors<T>' provides fields for commonly used constants. For example, 'SerialDesc' is a descending order value .
            VectorTextUtil.WriteLine(writer, "desc:\t{0}", desc);
            Vector<short> dst = Vectors.Shuffle(shifted, desc); // dst[i] = shifted[desc[i]].
            VectorTextUtil.WriteLine(writer, "Shuffle:\t{0}", dst);
#if NET7_0_OR_GREATER
            // Compare BCL function . 
            Vector<short> dstBCL = default; // Since `.NET 7.0`, the Shuffle method has been provided in Vector128/Vector256, but the Shuffle method has not yet been provided in Vector .
            if (Vector<short>.Count == Vector128<short>.Count) {
                dstBCL = Vector128.Shuffle(shifted.AsVector128(), desc.AsVector128()).AsVector();
            } else if (Vector<short>.Count == Vector256<short>.Count) {
                dstBCL = Vector256.Shuffle(shifted.AsVector256(), desc.AsVector256()).AsVector();
            }
            VectorTextUtil.WriteLine(writer, "Equals to BCL Shuffle:\t{0}", dst.Equals(dstBCL));
#endif
            // Shuffle_Args and Shuffle_Core
            Vectors.Shuffle_Args(desc, out var args0, out var args1); // The suffix is the `Args' method used for parameter calculation, which involves processing such as parameter transformation in advance It is suitable for external loop .
            Vector<short> dst2 = Vectors.Shuffle_Core(shifted, args0, args1); // The suffix is the `Core` method used for core calculations, which calculates based on cached parameters It is suitable for internal loop to improve performance .
            VectorTextUtil.WriteLine(writer, "Equals to Shuffle_Core:\t{0}", dst.Equals(dst2));
            writer.WriteLine();

            // Show AcceleratedTypes.
            VectorTextUtil.WriteLine(writer, "ShiftLeft_AcceleratedTypes:\t{0}", Vectors.ShiftLeft_AcceleratedTypes);
            VectorTextUtil.WriteLine(writer, "Shuffle_AcceleratedTypes:\t{0}", Vectors.Shuffle_AcceleratedTypes);
        }
    }
}

3) Example results

.NET8.0 on X86

Program: VectorTraits.Sample

VectorTraits.Sample

IsRelease:	True
Environment.ProcessorCount:	16
Environment.Is64BitProcess:	True
Environment.OSVersion:	Microsoft Windows NT 10.0.22631.0
Environment.Version:	8.0.8
Stopwatch.Frequency:	10000000
RuntimeEnvironment.GetRuntimeDirectory:	C:\Program Files\dotnet\shared\Microsoft.NETCore.App\8.0.8\
RuntimeInformation.FrameworkDescription:	.NET 8.0.8
RuntimeInformation.OSArchitecture:	X64
RuntimeInformation.OSDescription:	Microsoft Windows 10.0.22631
RuntimeInformation.RuntimeIdentifier:	win-x64
IntPtr.Size:	8
BitConverter.IsLittleEndian:	True
Vector.IsHardwareAccelerated:	True
Vector<byte>.Count:	32	# 256bit
Vector<float>.Count:	8	# 256bit
Vector128.IsHardwareAccelerated:	True
Vector256.IsHardwareAccelerated:	True
Vector512.IsHardwareAccelerated:	True
Vector<T>.Assembly.CodeBase:	file:///C:/Program Files/dotnet/shared/Microsoft.NETCore.App/8.0.8/System.Private.CoreLib.dll
GetTargetFrameworkDisplayName(VectorTextUtil):	.NET 8.0
GetTargetFrameworkDisplayName(TraitsOutput):	.NET 8.0
VectorTraitsGlobal.InitCheckSum:	-2122844161	# 0x8177F7FF
VectorEnvironment.CpuModelName:	AMD Ryzen 7 7840H w/ Radeon 780M Graphics
VectorEnvironment.SupportedInstructionSets:	Aes, Avx, Avx2, Avx512BW, Avx512CD, Avx512DQ, Avx512F, Avx512Vbmi, Avx512VL, Bmi1, Bmi2, Fma, Lzcnt, Pclmulqdq, Popcnt, Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, X86Base
Vector128s.Instance:	WVectorTraits128Avx2	// Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Avx, Avx2, Avx512VL
Vector256s.Instance:	WVectorTraits256Avx2	// Avx, Avx2, Sse, Sse2, Avx512VL
Vector512s.Instance:	WVectorTraits512Avx512	// Avx512BW, Avx512DQ, Avx512F, Avx512Vbmi, Avx, Avx2, Sse, Sse2
Vectors.Instance:	VectorTraits256Avx2	// Avx, Avx2, Sse, Sse2, Avx512VL
Vectors.BaseInstance:	VectorTraits256Base

src:    <0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7>        # (0000 0001 0002 0003 0004 0005 0006 0007 0000 0001 0002 0003 0004 0005 0006 0007)
ShiftLeft:      <0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14>  # (0000 0002 0004 0006 0008 000A 000C 000E 0000 0002 0004 0006 0008 000A 000C 000E)
Equals to BCL ShiftLeft:        True
Equals to ShiftLeft_Const:      True

desc:   <15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0>  # (000F 000E 000D 000C 000B 000A 0009 0008 0007 0006 0005 0004 0003 0002 0001 0000)
Shuffle:        <14, 12, 10, 8, 6, 4, 2, 0, 14, 12, 10, 8, 6, 4, 2, 0>  # (000E 000C 000A 0008 0006 0004 0002 0000 000E 000C 000A 0008 0006 0004 0002 0000)
Equals to BCL Shuffle:  True
Equals to Shuffle_Core: True

ShiftLeft_AcceleratedTypes:     SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64        # (00001FE0)
Shuffle_AcceleratedTypes:       SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64, Single, Double        # (00007FE0)

Note: The text before Vectors.BaseInstance is the environment information output by TraitsOutput.OutputEnvironment. OutputEnvironment. The text starting from srcis the main code of the example. Since the CPU supports the X86 Avx2 instruction set,Vector<byte>.Countis 32(256bit), andVectors.InstanceisVectorTraits256Avx2`.

.NET8.0 on Arm

Program: VectorTraits.Sample

VectorTraits.Sample

IsRelease:	True
Environment.ProcessorCount:	2
Environment.Is64BitProcess:	True
Environment.OSVersion:	Unix 6.8.0.1015
Environment.Version:	8.0.7
Stopwatch.Frequency:	1000000000
RuntimeEnvironment.GetRuntimeDirectory:	/home/ubuntu/.dotnet/shared/Microsoft.NETCore.App/8.0.7/
RuntimeInformation.FrameworkDescription:	.NET 8.0.7
RuntimeInformation.OSArchitecture:	Arm64
RuntimeInformation.OSDescription:	Ubuntu 22.04.2 LTS
RuntimeInformation.RuntimeIdentifier:	linux-arm64
IntPtr.Size:	8
BitConverter.IsLittleEndian:	True
Vector.IsHardwareAccelerated:	True
Vector<byte>.Count:	16	# 128bit
Vector<float>.Count:	4	# 128bit
Vector128.IsHardwareAccelerated:	True
Vector256.IsHardwareAccelerated:	False
Vector512.IsHardwareAccelerated:	False
Vector<T>.Assembly.CodeBase:	file:///home/ubuntu/.dotnet/shared/Microsoft.NETCore.App/8.0.7/System.Private.CoreLib.dll
GetTargetFrameworkDisplayName(VectorTextUtil):	.NET 8.0
GetTargetFrameworkDisplayName(TraitsOutput):	.NET 8.0
VectorTraitsGlobal.InitCheckSum:	-2122844159	# 0x8177F801
VectorEnvironment.CpuModelName:	Neoverse-N1
VectorEnvironment.CpuFlags:	fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
VectorEnvironment.SupportedInstructionSets:	AdvSimd, Aes, ArmBase, Crc32, Dp, Rdm, Sha1, Sha256
Vector128s.Instance:	WVectorTraits128AdvSimdB64	// AdvSimd
Vectors.Instance:	VectorTraits128AdvSimdB64	// AdvSimd
Vectors.BaseInstance:	VectorTraits128Base

src:	<0, 1, 2, 3, 4, 5, 6, 7>	# (0000 0001 0002 0003 0004 0005 0006 0007)
ShiftLeft:	<0, 2, 4, 6, 8, 10, 12, 14>	# (0000 0002 0004 0006 0008 000A 000C 000E)
Equals to BCL ShiftLeft:	True
Equals to ShiftLeft_Const:	True

desc:	<7, 6, 5, 4, 3, 2, 1, 0>	# (0007 0006 0005 0004 0003 0002 0001 0000)
Shuffle:	<14, 12, 10, 8, 6, 4, 2, 0>	# (000E 000C 000A 0008 0006 0004 0002 0000)
Equals to BCL Shuffle:	True
Equals to Shuffle_Core:	True

ShiftLeft_AcceleratedTypes:	SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64	# (00001FE0)
Shuffle_AcceleratedTypes:	SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64, Single, Double	# (00007FE0)

The result is the same as the X86 one, only the environment information is different. Since the CPU supports Arm's AdvSimd instruction set, Vector<byte>.Count is 16(128bit) and Vectors.Instance is VectorTraits128AdvSimdB64.

.NET Framework 4.5 on X86

Program: VectorTraits.Sample.NetFw.

VectorTraits.Sample

IsRelease:	True
Environment.ProcessorCount:	16
Environment.Is64BitProcess:	True
Environment.OSVersion:	Microsoft Windows NT 6.2.9200.0
Environment.Version:	4.0.30319.42000
Stopwatch.Frequency:	10000000
RuntimeEnvironment.GetRuntimeDirectory:	C:\Windows\Microsoft.NET\Framework64\v4.0.30319\
RuntimeInformation.FrameworkDescription:	.NET Framework 4.8.9277.0
RuntimeInformation.OSArchitecture:	X64
RuntimeInformation.OSDescription:	Microsoft Windows 10.0.22631 
IntPtr.Size:	8
BitConverter.IsLittleEndian:	True
Vector.IsHardwareAccelerated:	True
Vector<byte>.Count:	32	# 256bit
Vector<float>.Count:	8	# 256bit
Vector<T>.Assembly.CodeBase:	file:///E:/zylSelf/Code/cs/base/VectorTraits/tests/VectorTraits.Benchmarks.NetFw/bin/Release/System.Numerics.Vectors.DLL
GetTargetFrameworkDisplayName(VectorTextUtil):	.NET Standard 1.1
GetTargetFrameworkDisplayName(TraitsOutput):	.NET Framework 4.5
VectorTraitsGlobal.InitCheckSum:	-25396097	# 0xFE7C7C7F
VectorEnvironment.CpuModelName:	AMD Ryzen 7 7840H w/ Radeon 780M Graphics
VectorEnvironment.SupportedInstructionSets:	
Vectors.Instance:	VectorTraits256Base	// 
Vectors.BaseInstance:	VectorTraits256Base

src:    <0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7>        # (0000 0001 0002 0003 0004 0005 0006 0007 0000 0001 0002 0003 0004 0005 0006 0007)
ShiftLeft:      <0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14>  # (0000 0002 0004 0006 0008 000A 000C 000E 0000 0002 0004 0006 0008 000A 000C 000E)
Equals to ShiftLeft_Const:      True

desc:   <15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0>  # (000F 000E 000D 000C 000B 000A 0009 0008 0007 0006 0005 0004 0003 0002 0001 0000)
Shuffle:        <14, 12, 10, 8, 6, 4, 2, 0, 14, 12, 10, 8, 6, 4, 2, 0>  # (000E 000C 000A 0008 0006 0004 0002 0000 000E 000C 000A 0008 0006 0004 0002 0000)
Equals to Shuffle_Core: True

ShiftLeft_AcceleratedTypes:     SByte, Byte, Int16, UInt16, Int32, UInt32       # (000007E0)
Shuffle_AcceleratedTypes:       None    # (00000000)

ShiftLeft/Shuffle of Vectors works fine. Since the CPU supports the X86 Avx2 instruction set, Vector<byte>.Count is 32 (256bit). Vectors.InstanceisVectorTraits256Base. It's not VectorTraits256Avx2because the intrinsic function wasn't supported until.NET Core 3.0`. The value of ShiftLeft_AcceleratedTypes contains types such as "Int16", which means that ShiftLeft is hardware-accelerated when using these types. The library makes clever use of vector algorithms to try to achieve hardware acceleration even without intrinsic functions.

Results of benchmark

Unit of data: Million operations per second. The larger the number, the better the performance.

ShiftLeft

ShiftLeft: Shifts each element of a vector left by the specified amount. It is a new vector method in .NET 7.0.

ShiftLeft - X86 - AMD Ryzen 7 7840H

TypeMethod.NET Framework.NET Core 2.1.NET Core 3.1.NET 5.0.NET 6.0.NET 7.0.NET 8.0
ByteSumSLLScalar1062.0461025.9361287.8651265.4461445.5751416.7121693.330
ByteSumSLLNetBcl1344.7381109.752
ByteSumSLLNetBcl_Const1281.9011164.382
ByteSumSLLTraits11312.49910715.92028897.86828611.23428219.20534068.74157456.802
ByteSumSLLTraits_Core55791.67552165.73253563.42168653.35959916.62267868.29174889.177
ByteSumSLLConstTraits13408.91612604.41238925.38857842.08157095.29462012.69262729.225
ByteSumSLLConstTraits_Core56843.52355673.52853642.48462674.39765797.70850869.84073873.979
Int16SumSLLScalar1081.716999.7671261.4751198.1111218.7671365.7541547.294
Int16SumSLLNetBcl32011.64634816.284
Int16SumSLLNetBcl_Const39975.92437368.541
Int16SumSLLTraits6752.3496185.96825221.85626382.70827125.95532617.94436448.716
Int16SumSLLTraits_Core34727.28331457.23831800.31032231.55335687.99637750.30530731.745
Int16SumSLLConstTraits6037.3676498.81927783.52637605.55940699.91439598.66336242.630
Int16SumSLLConstTraits_Core37678.43534784.61632625.54333694.33840019.32539380.40436914.775
Int32SumSLLScalar1369.1401315.8521514.6901521.5162284.6702484.4072409.358
Int32SumSLLNetBcl17373.56715954.004
Int32SumSLLNetBcl_Const17967.08015983.409
Int32SumSLLTraits3762.3743511.43313343.30412906.29312661.42317279.76015886.410
Int32SumSLLTraits_Core17324.27515468.38114587.93717407.82317886.65118052.16214126.571
Int32SumSLLConstTraits3910.6003724.41212646.54515290.34017745.99217829.07815991.615
Int32SumSLLConstTraits_Core16235.15414216.59815282.56516088.40017940.33015961.16616378.506
Int64SumSLLScalar1394.7191281.1561517.9381441.1602270.5212508.5772421.558
Int64SumSLLNetBcl7528.1848530.835
Int64SumSLLNetBcl_Const8743.5048471.981
Int64SumSLLTraits483.430494.3356677.5446570.7116635.0706891.7057469.236
Int64SumSLLTraits_Core479.761488.8277758.5158525.7848596.2908267.8557879.060
Int64SumSLLConstTraits509.585525.1957036.2236787.1018246.6018254.8808526.022
Int64SumSLLConstTraits_Core512.652528.3818229.9548747.1258711.5238871.9488647.339

Description.

BCL's method (Vector.ShiftLeft) runs on X86 platform, only Int16/Int32/Int64 are hardware accelerated, while Byte is not hardware accelerated. This is probably because the Avx2 instruction set only has 16-64 bit left shift instructions, and does not provide other types of instructions, so the BCL is converted to a software algorithm. For these types of numbers, this library will replace them with efficient algorithms realized by combinations of other instructions. For example, for Byte type, SumSLLConstTraits_Core in . NET 7.0 has the value of 73873.979, which is 73873.979/1693.330≈43.6264 times the performance of scalar algorithm, and 73873.979/1164.382≈63.4448 times the performance of BCL method. 32872.874/1137.564≈28.8976times. Because X86 intrinsic functions have only been available since.NET Core 3.0. Therefore, for Int64 types, hardware acceleration is not available until after .NET Core 3.0`.

For ShiftLeft, when shiftAmount is a constant, the performance is generally better than when it is a variable. This is true for both BCL and this library methods. Using this library's Core suffix optimizes performance by moving some operations out of the loop to be processed earlier. When the CPU provides instructions with constant parameters (the technical term is "immediate parameters"), the performance of the instructions is generally higher. So the library also provides a ConstCore suffix method, which selects the fastest instruction for that platform. Sometimes the performance fluctuates due to "CPU Turbo Boost", "other processes taking CPU resources", etc. But rest assured, after checking the assembly instructions of the Release's program runtime, it is already running on the best hardware instructions. An example of this is the following figure.

Vectors.ShiftLeft_Core_use_inline.png

ShiftLeft - Arm - AWS Arm t4g.small

TypeMethod.NET Core 3.1.NET 5.0.NET 6.0.NET 7.0.NET 8.0
ByteSumSLLScalar606.721607.751674.256890.8781238.814
ByteSumSLLNetBcl19585.98219831.927
ByteSumSLLNetBcl_Const19564.84019840.232
ByteSumSLLTraits5541.53213075.25913190.70513209.92719844.497
ByteSumSLLTraits_Core14048.51116947.48515828.57119589.43019841.525
ByteSumSLLConstTraits9734.87015699.31515853.77219511.95219811.385
ByteSumSLLConstTraits_Core13007.02816817.24715838.06019422.22219839.627
Int16SumSLLScalar606.135603.800605.734820.8801031.035
Int16SumSLLNetBcl9943.2209803.495
Int16SumSLLNetBcl_Const9937.6399837.136
Int16SumSLLTraits4215.3696547.5146558.2999923.0889839.256
Int16SumSLLTraits_Core7918.6888431.9347892.2359939.4699839.496
Int16SumSLLConstTraits6568.6067829.8607887.8429925.9889839.534
Int16SumSLLConstTraits_Core8494.5508416.7967902.4449914.3849823.608
Int32SumSLLScalar747.656746.013749.1081406.1221410.137
Int32SumSLLNetBcl4926.6514826.909
Int32SumSLLNetBcl_Const4917.7324840.232
Int32SumSLLTraits3293.9433269.1293278.3034925.4884836.941
Int32SumSLLTraits_Core4210.8113930.6193927.4084923.8674844.083
Int32SumSLLConstTraits3275.9863249.8093923.1764926.4634846.238
Int32SumSLLConstTraits_Core4205.2454199.1554156.6344925.4484844.679
Int64SumSLLScalar739.137729.158741.6731372.4801296.655
Int64SumSLLNetBcl2477.0252264.032
Int64SumSLLNetBcl_Const2473.1022251.272
Int64SumSLLTraits486.7341638.8351636.2331985.5962285.512
Int64SumSLLTraits_Core489.5542075.2731967.9022474.1052289.521
Int64SumSLLConstTraits467.3931930.8211968.7982471.1242308.745
Int64SumSLLConstTraits_Core466.2932074.6561968.8342476.6022281.018

Description.

The BCL method (Vector.ShiftLeft) runs on the Arm platform with hardware acceleration for integer types. The AdvSimd instruction set provides special instructions for left shifting of 8 to 64 bit integers. This library uses the same instructions when running on the Arm platform. The performance is close. Because Arm's intrinsic functions have only been available since .NET 5.0. The hardware acceleration for Int64 types is not available until after `.NET 5.0'.

ShiftRightArithmetic

ShiftRightArithmetic: Shifts (signed) each element of a vector right by the specified amount. It is a new vector method in .NET 7.0.

ShiftRightArithmetic - X86 - AMD Ryzen 7 7840H

TypeMethod.NET Framework.NET Core 2.1.NET Core 3.1.NET 5.0.NET 6.0.NET 7.0.NET 8.0
Int16SumSRAScalar1085.1761043.7311227.8221215.7291209.2301310.6451397.378
Int16SumSRANetBcl31888.64535102.079
Int16SumSRANetBcl_Const39751.01836630.458
Int16SumSRATraits1829.4051861.93825643.09626584.67526634.09331578.60237184.123
Int16SumSRATraits_Core1837.6631874.26233248.48136967.97236890.50837648.79837673.670
Int16SumSRAConstTraits1836.6531880.35128724.61336985.52839429.04132925.58837356.009
Int16SumSRAConstTraits_Core1830.4441879.35433935.62537498.16538127.79433120.54935752.947
Int32SumSRAScalar1362.8761321.5071508.8311508.3782226.6482555.6222327.611
Int32SumSRANetBcl16806.95815967.982
Int32SumSRANetBcl_Const18365.86116092.208
Int32SumSRATraits883.925895.13712901.50712508.76211931.48017609.10316282.512
Int32SumSRATraits_Core919.507931.41915956.78615252.82917412.02518296.49316230.128
Int32SumSRAConstTraits911.750942.52313450.04317314.81614198.09516799.44516393.351
Int32SumSRAConstTraits_Core917.228938.78915344.13615470.62917084.81618274.41116054.229
Int32SumSRAFastTraits915.754946.52113266.16815337.17114562.12917003.22416124.004
Int64SumSRAScalar1393.5401331.9631532.7191544.3061513.2451801.8592560.284
Int64SumSRANetBcl524.7028652.579
Int64SumSRANetBcl_Const557.1528870.207
Int64SumSRATraits482.604490.8044949.3284970.3284932.2774902.2397541.726
Int64SumSRATraits_Core509.432521.7695941.5476050.3226104.4336043.3378537.297
Int64SumSRAConstTraits510.778529.2985526.8935360.4605834.0756217.5097562.071
Int64SumSRAConstTraits_Core509.597531.3445899.7525978.3986049.7566171.2117720.979
SByteSumSRAScalar997.067974.1471278.0491350.0821227.7881328.3801387.993
SByteSumSRANetBcl1135.1771113.944
SByteSumSRANetBcl_Const1165.7801061.118
SByteSumSRATraits3635.5923696.78024686.30222906.32322437.12924879.96244225.353
SByteSumSRATraits_Core3652.6703743.42741915.60845147.92545375.30046792.94145642.076
SByteSumSRAConstTraits3651.1093753.76129819.07642019.51543095.16944048.30047091.982
SByteSumSRAConstTraits_Core3662.6943753.27039588.70146397.66547507.64843046.47746878.753

Description.

The BCL method (Vector.ShiftRightArithmetic) runs on X86 platforms with hardware acceleration only for Int16/Int32, but not for SByte/Int64. This is probably because the Avx2 instruction set only has 16-32 bit arithmetic right shift instructions. The Avx512 instruction set has added a 64 bit arithmetic right shift instruction. For these types of numbers, this library replaces them with efficient algorithms that are implemented by a combination of other instructions. As of .NET Core 3.0, hardware acceleration is available.

ShiftRightArithmetic - Arm - AWS Arm t4g.small

TypeMethod.NET Core 3.1.NET 5.0.NET 6.0.NET 7.0.NET 8.0
Int16SumSRAScalar604.429602.027606.297818.740830.302
Int16SumSRANetBcl9941.4129837.372
Int16SumSRANetBcl_Const9931.3979838.530
Int16SumSRATraits1713.8185611.3164949.5029932.2699837.893
Int16SumSRATraits_Core1928.1977881.8508435.0439930.9189707.757
Int16SumSRAConstTraits1936.0577776.3468432.0649926.3489834.469
Int16SumSRAConstTraits_Core1895.2917825.0368426.0859923.4149834.395
Int32SumSRAScalar745.287749.467747.4861181.6511244.019
Int32SumSRANetBcl4929.4384848.848
Int32SumSRANetBcl_Const4937.8244854.964
Int32SumSRATraits859.1732815.1132819.1164937.5624813.108
Int32SumSRATraits_Core945.6943917.3143916.9434933.9394787.843
Int32SumSRAConstTraits967.5763904.7504188.7134901.6804849.051
Int32SumSRAConstTraits_Core947.9553906.4714192.9514908.3544853.184
Int64SumSRAScalar738.902734.754741.3431185.2171243.954
Int64SumSRANetBcl2474.6202433.159
Int64SumSRANetBcl_Const2478.5192438.677
Int64SumSRATraits467.8381233.5061233.4011418.9702424.896
Int64SumSRATraits_Core468.4701952.9671971.4532478.2292424.819
Int64SumSRAConstTraits467.1821939.9691970.3212474.3402413.790
Int64SumSRAConstTraits_Core468.6342095.3522102.9582474.4732432.455
SByteSumSRAScalar608.671609.771652.251889.935830.400
SByteSumSRANetBcl19779.97219615.987
SByteSumSRANetBcl_Const19803.79919613.758
SByteSumSRATraits3482.53711212.3409894.24511352.19919512.654
SByteSumSRATraits_Core3857.46416756.19515733.71219816.16319419.454
SByteSumSRAConstTraits3905.02715518.19915732.34419791.97219617.529
SByteSumSRAConstTraits_Core3796.01816708.14216787.09019791.89119619.300

Description.

BCL methods (Vector.ShiftRightArithmetic) are hardware accelerated for integer types when running on Arm platforms. The AdvSimd instruction set provides special instructions for arithmetic right shifting of 8 to 64 bit integers. This library uses the same instructions when running on the Arm platform. The performance is similar. As of .NET 5.0, hardware acceleration is available.

Shuffle

Shuffle: Shuffle and clear. Creates a new vector by selecting values from an input vector using a set of indices. It is a new vector method in .NET 7.0. Since .NET 7.0, the Shuffle method has been provided in Vector128/Vector256, but the Shuffle method has not yet been provided in Vector.

Shuffle allows an index to exceed the valid range, and then sets the corresponding element to 0. This feature slows down performance a bit, so this library also provides the YShuffleKernel method (Only shuffle). If you want to make sure that the index is always within the valid range, it is faster to use YShuffleKernel.

Shuffle - X86 - AMD Ryzen 7 7840H

TypeMethod.NET Framework.NET Core 2.1.NET Core 3.1.NET 5.0.NET 6.0.NET 7.0.NET 8.0
Int16SumScalar1236.9441263.9081214.4841278.6571195.1881408.1791235.365
Int16Sum256_Bcl1074.656938.447
Int16Sum512_Bcl918.911
Int16SumTraits1221.0461255.3418067.49310943.13410421.69614194.28032579.746
Int16SumTraits_Args01278.6501211.36122661.64825363.98824123.55526722.24334671.910
Int16SumTraits_Args1255.1091154.80122911.64926138.76624804.17026585.68433172.777
Int16SumKernelTraits1269.7331192.0798698.11712377.32611972.40717610.47735632.301
Int16SumKernelTraits_Args01297.7651199.69723028.56425852.12225176.48224261.58236741.022
Int16SumKernelTraits_Args1270.8521142.88523265.59525960.40521744.41823156.07837227.607
Int32SumScalar850.057829.782816.013859.672817.223853.140837.720
Int32Sum256_Bcl755.314770.558
Int32Sum512_Bcl930.330
Int32SumTraits821.394844.38810852.53410832.76010943.34212695.69215067.794
Int32SumTraits_Args0864.447818.04212704.59115953.12715574.55414391.78515559.766
Int32SumTraits_Args810.166762.18312531.31014746.99114125.33513524.19315368.528
Int32SumKernelTraits825.747841.22914515.30814407.19014545.13116276.64815999.993
Int32SumKernelTraits_Args0856.015814.05514754.81014880.91617262.39014319.19916261.174
Int32SumKernelTraits_Args806.479765.21815073.76814604.62116999.00716367.11916422.220
Int64SumScalar425.474430.216457.179497.203465.105432.348425.921
Int64Sum256_Bcl506.686515.520
Int64Sum512_Bcl688.892
Int64SumTraits474.906431.2963789.3274192.9514280.5684155.8198171.028
Int64SumTraits_Args0423.703461.6646979.8857855.2418501.2717846.3038198.449
Int64SumTraits_Args446.260420.9256704.8748599.4418317.5507312.3628378.340
Int64SumKernelTraits473.823426.0814854.7935862.4405735.0745938.6998560.856
Int64SumKernelTraits_Args0424.508458.2487804.5758108.4089181.0868364.1068701.155
Int64SumKernelTraits_Args446.097428.5388386.2799239.3319198.7988344.9528673.715
SByteSumScalar1496.7831403.3481448.6601239.2771468.8271415.1391213.582
SByteSum256_Bcl901.1141022.223
SByteSum512_Bcl989.131
SByteSumTraits1476.7711494.14417086.31424231.46424097.62230243.43460885.250
SByteSumTraits_Args01392.1581331.08345038.80250540.40949090.08146979.78360672.985
SByteSumTraits_Args1389.0741295.64146794.99751069.26550078.24946518.75065261.554
SByteSumKernelTraits1476.6371242.19827650.93332894.21832711.66439630.93972350.167
SByteSumKernelTraits_Args01523.5431440.01144451.89149973.81351540.23648754.50272615.251
SByteSumKernelTraits_Args1395.1061274.94341001.99650067.09949654.80545904.50471412.964

Description.

BCL's method (Vector.Shuffle) runs on X86 platforms without hardware acceleration for all number types. This library replaces these types with efficient algorithms implemented by combinations of other instructions. As of .NET Core 3.0, hardware acceleration is available. Methods using this library's Core suffix optimize performance by moving some operations out of the loop to be processed earlier. This is especially true for the Shuffle method. YShuffleKernel can be used instead of Shuffle if you can ensure that the index is always in the valid range. It is faster. For Args suffixed methods, in addition to returning multiple values with the "out" keyword, ValueTuple can be used to receive multiple values, simplifying the code. However, be aware that ValueTuple can sometimes slow down performance.

Shuffle - Arm - AWS Arm t4g.small

TypeMethod.NET Core 3.1.NET 5.0.NET 6.0.NET 7.0.NET 8.0
Int16SumScalar427.276421.887421.454526.589516.294
Int16Sum128_Bcl482.907468.383
Int16SumTraits428.2814922.8765555.6555864.1939711.569
Int16SumTraits_Args0428.9287902.4208416.6249925.4419709.555
Int16SumTraits_Args405.5372809.4832798.9259880.8049707.490
Int16SumKernelTraits427.6375650.9136540.4467957.1759833.813
Int16SumKernelTraits_Args0427.5787897.2247891.8949929.8639819.774
Int16SumKernelTraits_Args405.2232811.1952797.1709861.3309829.822
Int32SumScalar286.900281.167281.838317.876309.427
Int32Sum128_Bcl304.320301.222
Int32SumTraits286.5962311.2092472.5922917.3434801.979
Int32SumTraits_Args0288.0664185.4303928.6044934.5904821.784
Int32SumTraits_Args270.2491396.3231401.7424886.6694806.886
Int32SumKernelTraits287.3862677.3943247.6923953.5734846.437
Int32SumKernelTraits_Args0286.7243919.6194182.6174930.4694852.808
Int32SumKernelTraits_Args270.7241399.9681395.9534899.3594853.093
Int64SumScalar448.592440.758444.884552.061534.531
Int64Sum128_Bcl708.356692.663
Int64SumTraits190.9131005.6141064.6501255.0252448.365
Int64SumTraits_Args0426.8092090.8872100.5272479.8212451.574
Int64SumTraits_Args179.534698.013699.2002457.8982451.414
Int64SumKernelTraits448.0651237.2581412.8761753.4572434.096
Int64SumKernelTraits_Args0449.8572101.4111967.1522469.0542443.626
Int64SumKernelTraits_Args345.877701.805698.7532456.7612451.680
SByteSumScalar665.739664.224658.168834.224803.566
SByteSum128_Bcl647.757610.244
SByteSumTraits680.59013176.73016739.16119723.56719531.685
SByteSumTraits_Args0660.59515704.39315724.34019723.85219530.241
SByteSumTraits_Args637.5685597.6445602.80319605.28919527.338
SByteSumKernelTraits672.78415604.59716732.62919692.57119533.892
SByteSumKernelTraits_Args0675.23616718.95915715.51219729.14419534.508
SByteSumKernelTraits_Args642.7955573.9995598.16819588.65519538.006

Description.

BCL's method (Vector.Shuffle) runs on the Arm platform without hardware acceleration for all number types. This library replaces these types with efficient algorithms implemented by combinations of other instructions. As of .NET 5.0, hardware acceleration is available. Note that prior to .NET 7.0, SumTraits_Args sometimes had a large performance difference from SumTraits_Args0, due to the large performance loss of ValueTuple under Arm.

YNarrowSaturate

YNarrowSaturate: Saturate narrows two Vector instances into one Vector .

YNarrowSaturate - X86 - AMD Ryzen 7 7840H

TypeMethod.NET Framework.NET Core 2.1.NET Core 3.1.NET 5.0.NET 6.0.NET 7.0.NET 8.0
Int16SumNarrow_If208.976197.924195.466200.430197.261205.623221.224
Int16SumNarrow_MinMax200.034201.184197.505208.715199.736222.635208.102
Int16SumNarrowVectorBase21160.11919565.03519063.34619960.92519532.39819258.68924197.090
Int16SumNarrowVectorTraits20477.03818251.73144050.63045196.12843674.65444677.38947325.429
Int32SumNarrow_If211.070218.235225.479211.761207.353223.740232.860
Int32SumNarrow_MinMax221.396206.735214.815214.341211.238210.944223.415
Int32SumNarrowVectorBase9753.2589549.3139743.0429519.1889577.99310513.07112059.829
Int32SumNarrowVectorTraits9117.8699253.89120503.08820225.44719198.94719012.81519398.087
Int64SumNarrow_If207.654206.920215.020207.405207.239220.198227.592
Int64SumNarrow_MinMax205.724201.036203.815200.292213.422213.819231.741
Int64SumNarrowVectorBase2951.2642720.6632835.8822949.4232915.4734372.6125917.536
Int64SumNarrowVectorTraits2941.3362696.5434690.3914875.8514917.1493808.7449411.507
UInt16SumNarrow_If1263.9601205.8761247.4091184.5371124.5201175.7331387.128
UInt16SumNarrow_MinMax1363.2981283.0271336.1031178.8601344.978761.9081487.848
UInt16SumNarrowVectorBase25617.83125358.18225019.79525056.65626527.17025337.76930941.796
UInt16SumNarrowVectorTraits24795.43324950.27933163.80141303.84640678.06729966.48145560.104
UInt32SumNarrow_If1446.2971396.1481364.9531339.8051382.4701240.1581507.078
UInt32SumNarrow_MinMax1461.8841346.5421363.8531376.3901373.016960.1041383.498
UInt32SumNarrowVectorBase12509.78011160.71111971.25911511.97811080.15811897.23715997.508
UInt32SumNarrowVectorTraits12962.03011581.01414895.00916343.37217051.60214727.10719760.603
UInt64SumNarrow_If1003.5701326.642913.881912.071878.8481312.3521874.180
UInt64SumNarrow_MinMax1455.4021404.3911392.157891.629902.245937.792895.795
UInt64SumNarrowVectorBase3340.3773102.9543033.0443449.1133649.4225104.5507693.314
UInt64SumNarrowVectorTraits3306.0183050.4924497.3855401.9145969.6214527.5889530.757

Description.

For 16-32 bit integers, SumNarrowVectorTraits are much better than SumNarrowVectorBase after .NET Core 3.1. This is because X86 provides specialized instructions. For 64-bit integers (Int64/UInt64), X86 does not provide an equivalent instruction. However, the SumNarrowVectorTraits version of the code uses a better intrinsic function algorithm, so it still outperforms SumNarrowVectorBase in many cases.

YNarrowSaturate - Arm - AWS Arm t4g.small

TypeMethod.NET Core 3.1.NET 5.0.NET 6.0.NET 7.0.NET 8.0
Int16SumNarrow_If157.270154.692157.383181.610193.265
Int16SumNarrow_MinMax160.909165.733108.425184.240189.973
Int16SumNarrowVectorBase6100.2756193.9386308.1187201.7358261.974
Int16SumNarrowVectorTraits6102.23813460.35813445.82415514.26113674.647
Int32SumNarrow_If163.854165.352165.160190.240213.807
Int32SumNarrow_MinMax154.976162.019161.884195.349194.881
Int32SumNarrowVectorBase3047.9233268.9333253.3783532.1284034.752
Int32SumNarrowVectorTraits3125.4986121.5536162.5337914.6416782.358
Int64SumNarrow_If161.788160.690161.656203.670190.163
Int64SumNarrow_MinMax160.836157.655164.693194.496201.793
Int64SumNarrowVectorBase728.6291157.1041139.3721231.8771326.584
Int64SumNarrowVectorTraits727.6033114.7203307.2054088.6773409.341
UInt16SumNarrow_If527.761515.076531.818608.056832.441
UInt16SumNarrow_MinMax573.087525.410576.628608.744893.594
UInt16SumNarrowVectorBase8361.1208439.5777945.4868853.73111829.808
UInt16SumNarrowVectorTraits8307.68013106.61314179.29713964.21316532.648
UInt32SumNarrow_If537.550534.718539.467620.874989.646
UInt32SumNarrow_MinMax539.997537.029545.333620.923827.472
UInt32SumNarrowVectorBase4099.7034021.1543963.4634356.8045896.924
UInt32SumNarrowVectorTraits4024.3106340.9946977.1516619.0097993.300
UInt64SumNarrow_If619.788621.120620.256827.649995.113
UInt64SumNarrow_MinMax619.494620.151620.119818.259994.695
UInt64SumNarrowVectorBase1229.7231821.2321848.6321805.4992169.309
UInt64SumNarrowVectorTraits1228.9113489.3033526.5483480.2124100.727

Description.

Since .NET 5.0, the Arm intrinsic function is provided. Therefore, starting from NET 5.0, SumNarrowVectorTraits are much more powerful than SumNarrowVectorBase.

YGroup3Unzip

YGroup3Unzip: De-Interleave 3-element groups into 3 vectors. It converts the 3-element groups AoS to SoA. It can also deinterleave packed RGB pixel data into R,G,B planar data.

YGroup3UnzipX2: De-Interleave 3-element groups into 3 vectors and process 2x data.

YGroup3Unzip - X86 - AMD Ryzen 7 7840H

TypeMethod.NET Framework.NET Core 2.1.NET Core 3.1.NET 5.0.NET 6.0.NET 7.0.NET 8.0
ByteSumBase_Basic255.172496.713501.725499.601566.925505.052670.702
ByteSumBase1140.6161053.3521089.1031138.2351111.1141478.6751463.708
ByteSumTraits1121.9041086.7997468.21611280.24611541.67112438.17121865.365
ByteSumX2Base2169.0252088.3532171.1432111.3322179.0992812.5752973.122
ByteSumX2Traits2229.9772160.51610419.95110989.67310985.33011472.25122393.695
Int16SumBase_Basic213.465389.617439.760352.833453.870404.842533.252
Int16SumBase738.972723.809686.669739.079728.0611015.7091008.942
Int16SumTraits759.109691.2733767.0555383.5955638.0946270.97110452.168
Int16SumX2Base1327.2171262.4001260.5471312.8661288.7271723.5431761.102
Int16SumX2Traits1320.5451227.5306120.1756190.4446208.9935798.71810909.299
Int32SumBase_Basic186.128276.261295.992219.993323.416280.863391.511
Int32SumBase184.001273.403306.846224.431320.332551.148555.068
Int32SumTraits189.108277.0596262.6876454.6416392.2896488.1276951.683
Int32SumX2Base155.218257.316284.894247.659318.4921072.5981093.091
Int32SumX2Traits160.252253.3195049.7206341.3906285.6816215.0977422.183
Int64SumBase_Basic136.976170.057187.362131.130193.633175.953240.232
Int64SumBase135.652170.323187.933125.485192.634168.300238.422
Int64SumTraits135.704167.9004095.4103868.1994015.4114061.9204385.505
Int64SumX2Base108.319151.252178.444137.145182.990155.501243.663
Int64SumX2Traits109.441151.2432684.6133883.2373978.6483893.3584785.675

Description.

YGroup3Unzip - Arm - AWS Arm t4g.small

TypeMethod.NET Core 3.1.NET 6.0.NET 7.0.NET 8.0
ByteSumBase_Basic263.957265.524327.819381.159
ByteSumBase380.369406.259430.545443.813
ByteSumTraits378.7104381.5754113.3046510.157
ByteSumX2Base702.851728.691740.690767.491
ByteSumX2Traits700.5394412.7854273.7635294.112
Int16SumBase_Basic188.885189.823222.856279.398
Int16SumBase213.360228.410235.157242.377
Int16SumTraits213.3561926.5592134.9253037.124
Int16SumX2Base419.434448.638466.043475.565
Int16SumX2Traits419.4422413.7942650.0312638.161
Int32SumBase_Basic138.088143.089154.241196.818
Int32SumBase141.071143.390186.784198.177
Int32SumTraits144.6961033.8991069.9741494.205
Int32SumX2Base121.726138.986275.479310.983
Int32SumX2Traits119.4681598.1851547.7951618.239
Int64SumBase_Basic109.766100.52384.039189.270
Int64SumBase109.531102.08481.358185.056
Int64SumTraits107.3351153.3331176.3151191.362
Int64SumX2Base97.85796.11179.729203.008
Int64SumX2Traits98.1621216.7161155.3021374.619

More results

See: BenchmarkResults

Documentation

More samples

ChangeLog

Full list: ChangeLog