Awesome
Octavian
To make sure CPUSummary 1.11 and newer are using Hwloc
, you may want to run
julia> using CPUSummary
julia> CPUSummary.use_hwloc(true);
which will hopefully enable accurate hardware information. This is the default, so it should typically be unnecessary.
Octavian.jl is a multi-threaded BLAS-like library that provides pure Julia matrix multiplication on the CPU, built on top of LoopVectorization.jl.
Please see the Octavian documentation.
Octavian dropped 32bit Julia support. See PR#157. If you're interested in restoring it, please file a PR to fix failing tests.
Benchmarks
You can run benchmarks using BLASBenchmarksCPU.jl:
julia> @time using BLASBenchmarksCPU
7.278954 seconds (17.59 M allocations: 1.107 GiB, 6.22% gc time)
julia> rb = runbench(sizes = logspace(10, 1_000, 200)); plot(rb, displayplot = false);
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 2:25:04
Size: (1000, 1000, 1000)
BLIS: (MedianGFLOPS = 1051.0, MaxGFLOPS = 1476.0)
Gaius: (MedianGFLOPS = 765.8, MaxGFLOPS = 941.7)
MKL: (MedianGFLOPS = 1348.0, MaxGFLOPS = 1589.0)
Octavian: (MedianGFLOPS = 1816.0, MaxGFLOPS = 1895.0)
OpenBLAS: (MedianGFLOPS = 1254.0, MaxGFLOPS = 1385.0)
Tullio: (MedianGFLOPS = 1102.0, MaxGFLOPS = 1196.0)
LoopVectorization: (MedianGFLOPS = 1552.0, MaxGFLOPS = 1721.0)
julia> versioninfo()
Julia Version 1.7.0-DEV.1124
Commit d18cf93bac* (2021-05-19 16:11 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
Resulted in the following:
Related Packages
Julia Package | CPU | GPU |
---|---|---|
Gaius.jl | Yes | No |
GemmKernels.jl | No | Yes |
Octavian.jl | Yes | No |
Tullio.jl | Yes | Yes |
In general:
- Octavian has the fastest CPU performance.
- GemmKernels has the fastest GPU performance.
- Tullio is the most flexible.