Awesome

Octavian

To make sure CPUSummary 1.11 and newer are using Hwloc, you may want to run

julia> using CPUSummary

julia> CPUSummary.use_hwloc(true);

which will hopefully enable accurate hardware information. This is the default, so it should typically be unnecessary.

Octavian.jl is a multi-threaded BLAS-like library that provides pure Julia matrix multiplication on the CPU, built on top of LoopVectorization.jl.

Please see the Octavian documentation.

Octavian dropped 32bit Julia support. See PR#157. If you're interested in restoring it, please file a PR to fix failing tests.

Benchmarks

You can run benchmarks using BLASBenchmarksCPU.jl:

julia> @time using BLASBenchmarksCPU
  7.278954 seconds (17.59 M allocations: 1.107 GiB, 6.22% gc time)

julia> rb = runbench(sizes = logspace(10, 1_000, 200)); plot(rb, displayplot = false);
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 2:25:04
  Size:               (1000, 1000, 1000)
  BLIS:               (MedianGFLOPS = 1051.0, MaxGFLOPS = 1476.0)
  Gaius:              (MedianGFLOPS = 765.8, MaxGFLOPS = 941.7)
  MKL:                (MedianGFLOPS = 1348.0, MaxGFLOPS = 1589.0)
  Octavian:           (MedianGFLOPS = 1816.0, MaxGFLOPS = 1895.0)
  OpenBLAS:           (MedianGFLOPS = 1254.0, MaxGFLOPS = 1385.0)
  Tullio:             (MedianGFLOPS = 1102.0, MaxGFLOPS = 1196.0)
  LoopVectorization:  (MedianGFLOPS = 1552.0, MaxGFLOPS = 1721.0)

julia> versioninfo()
Julia Version 1.7.0-DEV.1124
Commit d18cf93bac* (2021-05-19 16:11 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

Resulted in the following: octavian10980xebench

Related Packages

Julia Package	CPU	GPU
Gaius.jl	Yes	No
GemmKernels.jl	No	Yes
Octavian.jl	Yes	No
Tullio.jl	Yes	Yes

In general:

Octavian has the fastest CPU performance.
GemmKernels has the fastest GPU performance.
Tullio is the most flexible.