Home

Awesome

This repo evaluates different matrix multiplication implementations given two large square matrices (2000-by-2000 in the following example):

ImplementationLong description
NaiveMost obvious implementation
TransposedTransposing the second matrix for cache efficiency
sdot w/o hintsReplacing the inner loop with BLAS sdot()
sdot with hintssdot() with a bit unrolled loop
SSE sdotvectorized sdot() with explicit SSE instructions
SSE+tiling sdotSSE sdot() with loop tiling
OpenBLAS sdotsdot() provided by OpenBLAS
OpenBLAS sgemmsgemm() provided by OpenBLAS

To compile the evaluation program:

make CBLAS=/path/to/cblas/prefix

or omit the CBLAS setting you don't have it. After compilation, use

./matmul -h

to see the available options. Here is the result on my machines:

Implementation-aLinux,-n2000Linux,-n4000Linux/icc,-n4000Mac,-n2000
Naive07.53 sec188.85 sec173.76 sec77.45 sec
Transposed16.66 sec55.48 sec21.04 sec9.73 sec
sdot w/o hints46.66 sec55.04 sec21.35 sec9.70 sec
sdot with hints32.41 sec29.47 sec21.69 sec2.92 sec
SSE sdot21.36 sec21.79 sec22.18 sec2.92 sec
SSE+tiling sdot71.11 sec10.84 sec10.97 sec1.90 sec
OpenBLAS sdot52.69 sec28.87 sec5.61 sec
OpenBLAS sgemm60.63 sec4.91 sec0.86 sec
uBLAS7.43 sec165.74 sec
Eigen0.61 sec4.76 sec5.01 sec0.85 sec

The machine configurations are as follows:

MachineCPUOSCompiler
Linux2.6 GHz Xeon E5-2697CentOS 6gcc-4.4.7/icc-15.0.3
Mac1.7 GHz Intel Core i5-2557MOS X 10.9.5clang-600.0.57/LLVM-3.5svn

On both machines, OpenBLAS-0.2.18 is compiled with the following options (no AVX or multithreading):

TARGET=CORE2
BINARY=64
USE_THREAD=0
NO_SHARED=1
ONLY_CBLAS=1
NO_LAPACK=1
NO_LAPACKE=1