Home

Awesome

Build & test for dotnet 3.1, 5.0, 6.0

SIMDArray FSharp

SIMD and other Performance enhanced Array operations for F#

Example Usage

//Faster map

let array = [| 1 .. 1000 |]
let squaredArray = array |> Array.SIMD.map (fun x -> x*x) (fun x -> x*x)  

// Map and many other functions need one lambda to map the Vector<T>, 
// and one to handle any leftover elements if array is not divisible by 
// Vector<T>.Count. In the case of simple arithmetic operations they can
// often be the same as shown here. If you arrange your arrays such that 
// they will never have leftovers, or don't care how leftovers are treated 
// just pass a nop like so:

open SIMDArrayUtils

let array = [|1;2;3;4;5;6;7;8|]
let squaredArray = array |> Array.SIMD.map (fun x -> x*x) nop


// Some functions can be used just like the existing array functions but run faster
// such as create and sum:

let newArray = Array.SIMD.create 1000 5 //create a new array of length 1000 filled with 5
let sum = Array.SIMD.sum newArray

// The Performance module has functions that are faster and/or use less memory
// via other means than SIMD. Usually by relaxing ordering constraints or adding
// constraints to predicates:

let distinctElements = Array.Performance.distinctUnordered someArray
let filteredElements = Array.Performance.filterLessThan 5 someArray
let filteredElements = Array.Performance.filterSimplePredicate (fun x -> x*x < 100) someArray
Array.Performance.mapInPlace (fun x-> x*x) someArray

// The SIMDParallel module has parallelized versions of some of the SIMD operations:

let sum = Array.SIMDParallel.sum array
let map = Array.SIMDParallel.map (fun x -> x*x) array

// Two extensions are added to System.Threading.Tasks.Parallel, to enable Parallel.For loops
// with a stride length efficiently. They also have much less overhead. You can use them to roll your own 
// parallel SIMD functions, or any parallel operation that needs a stride length > 1

// Using:
// ForStride (fromInclusive : int) (toExclusive :int) (stride : int) (f : int -> unit)
// You can map each Vector in an array and store it in result
Parallel.ForStride 0 array.Length (Vector< ^T>.Count) 
        (fun i -> (vf (Vector< ^T>(array,i ))).CopyTo(result,i))

// Using:
// ForStrideAggreagate (fromInclusive : int) (toExclusive :int) (stride : int) (acc: ^T) (f : int -> ^T -> ^T) combiner
// You can sum or otherwise aggregate the elements of an array a Vector at a time, starting from an initial acc
let result = Parallel.ForStrideAggreagate 0 array.Length (Vector< ^T>.Count) Vector< ^T>(0)
					(fun i acc -> acc + (Vector< ^T>(array,i)))  
					(fun x acc -> x + acc)  //combines the results from each task into a final Vector that is returned


Notes

Only 64 bit builds are supported. Mono should work with 5.0+, but I have not yet tested it. Performance improvements will vary depending on your CPU architecture, width of Vector type, and the operations you apply. For small arrays the core libs may be faster due SIMD overhead. When measuring performance be sure to use Release builds with optimizations turned on.

Floating point addition is not associative, so results with SIMD operations will not be identical, though often they will be more accurate, such as in the case of sum, or average.

Upd: .NET 7.0 Basic Tests

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1526 (21H2)
AMD Ryzen 7 3800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.3.22179.4
  [Host]     : .NET 7.0.0 (7.0.22.17504), X64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.17504), X64 RyuJIT


|     Method |  Length |            Mean |         Error |        StdDev |   Gen 0 |   Gen 1 |   Gen 2 |   Allocated |
|----------- |-------- |----------------:|--------------:|--------------:|--------:|--------:|--------:|------------:|
|     ForSum |     100 |        98.78 ns |      0.533 ns |      0.473 ns |  0.0507 |       - |       - |       424 B |
| ForSumSIMD |     100 |        56.32 ns |      0.378 ns |      0.353 ns |  0.0507 |  0.0001 |       - |       424 B |
|        Dot |     100 |       157.32 ns |      0.672 ns |      0.629 ns |       - |       - |       - |           - |
|    DotSIMD |     100 |        19.59 ns |      0.121 ns |      0.107 ns |       - |       - |       - |           - |
|        Max |     100 |        55.57 ns |      0.146 ns |      0.129 ns |       - |       - |       - |           - |
|    MaxSIMD |     100 |        13.53 ns |      0.070 ns |      0.065 ns |       - |       - |       - |           - |
|      MaxBy |     100 |        60.37 ns |      0.163 ns |      0.153 ns |       - |       - |       - |           - |
|  MaxBySIMD |     100 |        20.06 ns |      0.063 ns |      0.056 ns |       - |       - |       - |           - |
|     ForSum |    1000 |       862.28 ns |      5.412 ns |      5.063 ns |  0.4807 |  0.0067 |       - |     4,024 B |
| ForSumSIMD |    1000 |       441.22 ns |      2.874 ns |      2.548 ns |  0.4809 |  0.0072 |       - |     4,024 B |
|        Dot |    1000 |     1,484.23 ns |      5.292 ns |      4.691 ns |       - |       - |       - |           - |
|    DotSIMD |    1000 |       162.66 ns |      1.095 ns |      0.971 ns |       - |       - |       - |           - |
|        Max |    1000 |       526.03 ns |      2.177 ns |      1.818 ns |       - |       - |       - |           - |
|    MaxSIMD |    1000 |        44.45 ns |      0.101 ns |      0.094 ns |       - |       - |       - |           - |
|      MaxBy |    1000 |       506.51 ns |      0.619 ns |      0.548 ns |       - |       - |       - |           - |
|  MaxBySIMD |    1000 |       139.48 ns |      0.126 ns |      0.106 ns |       - |       - |       - |           - |
|     ForSum | 1000000 | 1,642,884.15 ns | 32,686.799 ns | 52,783.087 ns | 93.7500 | 93.7500 | 93.7500 | 4,000,061 B |
| ForSumSIMD | 1000000 |   484,576.66 ns |  9,685.048 ns |  9,512.012 ns | 95.7031 | 95.7031 | 95.7031 | 4,000,055 B |
|        Dot | 1000000 | 1,468,907.49 ns |  6,495.111 ns |  5,070.956 ns |       - |       - |       - |           - |
|    DotSIMD | 1000000 |   160,549.66 ns |    277.915 ns |    232.071 ns |       - |       - |       - |           - |
|        Max | 1000000 |   485,969.64 ns |    565.230 ns |    501.061 ns |       - |       - |       - |           - |
|    MaxSIMD | 1000000 |    48,748.71 ns |     72.373 ns |     67.698 ns |       - |       - |       - |           - |
|      MaxBy | 1000000 |   490,922.69 ns |    563.828 ns |    470.822 ns |       - |       - |       - |           - |
|  MaxBySIMD | 1000000 |   135,049.15 ns |     57.546 ns |     51.013 ns |       - |       - |       - |           - |


Performance Comparison vs Standard Array Functions


Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240907 ticks, Resolution=446.2479 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1590.0

Type=SIMDBenchmark  Mode=Throughput  Platform=X64  
Jit=RyuJit  GarbageCollection=Concurrent Workstation  

Sum 1 million 32bit ints, ParallelSIMD vs SIMD vs Core Lib <a name="parallel"></a>

MethodLengthMedianStdDevScaledGen 0Gen 1Gen 2Bytes Allocated/Op
sum1000000979.9477 us15.4036 us1.00--1.0014,967.09
SIMDsum1000000163.5663 us2.7872 us0.17--0.171,960.97
SIMDParallelsum100000082.3069 us6.4637 us0.083.74-0.041,674.94

With 32bit Floats Vs Core Lib. Map function (fun x -> x*x)<a name="core32"></a>

MethodLengthMedianStdDevGen 0Gen 1Gen 2Bytes Allocated/Op
SIMDContains1032.3354 ns0.0933 ns0.04--22.80
Contains1013.0234 ns0.6457 ns---0.00
SIMDMap1037.3615 ns0.0693 ns0.09--53.95
Map1015.6651 ns0.2422 ns0.04--25.80
SIMDSum1019.3450 ns0.1866 ns---0.00
Sum106.2273 ns0.2982 ns---0.00
SIMDMax1020.8972 ns0.7380 ns---0.00
Max107.9275 ns0.9701 ns---0.00
SIMDContains10061.6295 ns5.0472 ns0.04--24.92
Contains100140.9920 ns2.4739 ns---0.01
SIMDMap10075.8733 ns0.5875 ns0.33--192.40
Map100120.3029 ns0.4232 ns0.29--172.39
SIMDSum10032.0058 ns1.1225 ns---0.00
Sum10077.6100 ns2.4902 ns---0.00
SIMDMax10035.9042 ns2.0587 ns---0.00
Max10092.1754 ns9.6637 ns---0.00
SIMDContains1000417.0760 ns10.6672 ns---0.04
Contains10001,333.0239 ns11.8959 ns---0.07
SIMDMap1000439.8549 ns7.5810 ns3.05--2,176.91
Map10001,073.2894 ns16.1444 ns2.93--2,086.24
SIMDSum1000162.8308 ns5.8158 ns---0.01
Sum1000947.1124 ns14.4370 ns---0.07
SIMDMax1000167.0257 ns5.3584 ns---0.01
Max1000698.2252 ns21.2244 ns---0.03
SIMDContains1000000427,765.2001 ns3,541.8344 ns--0.237,507.17
Contains10000001,315,198.8375 ns19,634.6409 ns--0.3614,912.24
SIMDMap10000001,747,002.9295 ns18,219.0807 ns--519.181,198,305.57
Map10000001,962,408.1761 ns23,319.8186 ns--746.001,702,687.72
SIMDSum1000000160,972.7015 ns3,359.1696 ns--0.051,960.97
Sum1000000955,224.0942 ns12,365.7613 ns--0.3814,853.87
SIMDMax1000000158,835.3746 ns3,773.1697 ns--0.061,961.66
Max1000000633,761.7634 ns6,149.8767 ns--0.247,495.76

With 64bit Floats vs Core Lib. Map function (fun x -> x*x+x)<a name="core64"></a>

MethodLengthMedianStdDevGen 0Gen 1Gen 2Bytes Allocated/Op
SIMDContains1000842.2604 ns36.6615 ns---0.13
Contains10001,338.2032 ns21.7835 ns---0.13
SIMDSum1000302.8986 ns12.0417 ns---0.03
Sum1000953.9314 ns7.3770 ns---0.13
SIMDMax1000302.3690 ns11.8064 ns---0.03
Max1000713.9227 ns23.1721 ns---0.07
SIMDMap1000905.3396 ns21.1726 ns2.79--4,447.68
Map10001,369.6668 ns17.1072 ns2.88--4,591.74
SIMDContains10000086,987.0417 ns212.5612 ns---204.08
Contains100000129,737.5287 ns2,300.6178 ns---398.91
SIMDSum10000030,836.7527 ns52.3596 ns---103.84
Sum10000097,310.6367 ns444.7469 ns---203.88
SIMDMax10000030,755.6959 ns189.2460 ns---103.84
Max10000065,190.8396 ns810.8605 ns---203.88
SIMDMap100000250,263.5686 ns23,822.3931 ns--351.03384,182.34
Map100000239,693.9435 ns20,283.1824 ns--350.24383,399.62
SIMDContains1000000952,116.9191 ns22,885.3666 ns--0.1729,960.47
Contains10000001,469,353.0761 ns44,872.5327 ns--0.1528,150.78
SIMDSum1000000493,523.5731 ns6,629.8292 ns--0.1215,020.79
Sum10000001,059,862.2497 ns21,029.2608 ns--0.1729,921.97
SIMDMax1000000486,232.3883 ns3,963.6126 ns--0.1115,080.61
Max1000000771,554.3061 ns7,083.0659 ns--0.1215,008.20
SIMDMap10000003,625,255.0307 ns40,939.9131 ns--439.003,763,516.65
Map10000003,490,854.2334 ns51,255.2300 ns--413.003,589,365.95

With 32bit Floats vs MathNET.Numerics managed. Map function (fun x -> x*x+x) <a name="mathnet"></a>

MethodLengthMedianStdDevGen 0Gen 1Gen 2Bytes Allocated/Op
SIMDMapInPlace10046.5269 ns4.9229 ns0.08--22.54
MathNETMapInPlace100354.0866 ns7.5375 ns0.36--99.59
SIMDSum10032.0283 ns2.9529 ns---0.00
MathNETSum10088.7532 ns1.9561 ns---0.00
SIMDMapInPlace1000165.7885 ns9.0778 ns---0.01
MathNETMapInPlace10003,057.9378 ns56.8845 ns0.30--94.64
SIMDSum1000163.1672 ns6.7001 ns---0.01
MathNETSum1000962.2084 ns13.9839 ns---0.12
SIMDMapInPlace10000021,078.0491 ns627.8978 ns---56.61
MathNETMapInPlace100000104,831.7547 ns8,823.8473 ns5.26--2,267.50
SIMDSum10000015,134.0240 ns708.8177 ns---46.02
MathNETSum10000097,051.7780 ns875.9276 ns---217.82
SIMDMapInPlace1000000220,760.2212 ns7,167.1597 ns--0.467,402.18
MathNETMapInPlace1000000824,388.9221 ns47,134.8321 ns--1.8733,210.67
SIMDSum1000000159,887.6959 ns5,030.3486 ns--0.183,433.93
MathNETSum1000000967,761.7422 ns17,557.1206 ns--2.0029,450.93

With 32bit Floats vs MathNET.Numerics MKL Native. Adding two arrays <a name="mathnetnative"></a>

MethodLengthMedianStdDevGen 0Gen 1Gen 2Bytes Allocated/Op
SIMDMap210092.1515 ns3.0304 ns2.70--212.76
MathNETAdd100156.7522 ns7.3969 ns2.92--230.42
SIMDMap21000493.5448 ns8.1340 ns21.40--2,048.32
MathNETAdd1000444.0753 ns5.9375 ns20.12--1,553.56
SIMDMap2100000161,024.7782 ns24,704.0627 ns--2,348.29197,602.33
MathNETAdd100000155,985.3149 ns1,478.0502 ns--1,755.36155,754.29
SIMDMap210000002,024,351.2170 ns242,101.0167 ns--3,317.762,025,584.78
MathNETAdd10000001,551,270.9391 ns216,545.6630 ns--2,466.001,693,319.93