Awesome
1. Neo - A Matrix library
This library is meant to provide basic linear algebra operations for Nim applications. The ambition would be to become a stable basis on which to develop a scientific ecosystem for Nim, much like Numpy does for Python.
The library has been tested on Ubuntu Linux 16.04 and 20.04 64-bit using either ATLAS, OpenBlas or Intel MKL. It was also tested on OSX Yosemite to Monterey. The GPU support has been tested using NVIDIA CUDA 8.0 up to 10.x.
The library is currently aligned with latest Nim devel.
API documentation is here
A lot of examples are available in the tests.
Table of contents
<!-- TOC depthfrom:2 depthto:6 orderedlist:false updateonsave:true withlinks:true -->- Introduction
- Working on the CPU
- Working on the GPU
- Static typing for dimensions
- Design
- Linking external libraries
- TODO
- Contributing
1.1. Introduction
The library revolves around operations on vectors and matrices of floating point numbers. It allows to compute operations either on the CPU or on the GPU offering identical APIs.
The library defines types Matrix[A]
and Vector[A]
, where A
is sometimes
restricted to be float32
or float64
(usually to use BLAS and LAPACK
routines). Actually, Vector[A]
is just a small wrapper around seq[A]
, which
allows to perform linear algebra operations on standard Nim sequences without
copying.
Similar types exist on the GPU side, and there are facilities to move them back and forth from the CPU.
Neo makes use of many standard libraries such as BLAS, LAPACK and CUDA. See this section to learn how to link the correct implementation for your platform.
1.2. Working on the CPU
1.2.1. Dense linear algebra
1.2.1.1. Initialization
Here we show a few ways to create matrices and vectors. All matrices methods accept a parameter to define whether to store the matrix in row-major (that is, data are laid out in memory row by row) or column-major order (that is, data are laid out in memory column by column). The default is in each case column-major.
Whenever possible, we try to deduce whether to use 32 or 64 bits by appropriate
parameters. When this is not possible, there is an optional parameter float32
that can be passed to specify the precision (the default is 64 bit).
Static matrices and vectors can be created like this:
import neo
let
v1 = makeVector(5, proc(i: int): float64 = (i * i).float64)
v2 = randomVector(7, max = 3.0) # max is optional, default 1
v3 = constantVector(5, 3.5)
v4 = zeros(8)
v5 = ones(9)
v6 = vector(1.0, 2.0, 3.0, 4.0, 5.0)
v7 = vector([1.2, 3.4, 5.6])
m1 = makeMatrix(6, 3, proc(i, j: int): float64 = (i + j).float64)
m2 = randomMatrix(2, 8, max = 1.6) # max is optional, default 1
m3 = constantMatrix(3, 5, 1.8, order = rowMajor) # order is optional, default colMajor
m4 = ones(3, 6)
m5 = zeros(5, 2)
m6 = eye(7)
m7 = matrix(@[
@[1.2, 3.5, 4.3],
@[1.1, 4.2, 1.7]
])
All constructors that take as input an existing array or seq perform a copy of the data for memory safety.
1.2.1.2. Working with 32-bit
Some constructors (such as zeros
) allow a type specifier if one wants to
create a 32-bit vector or matrix. The following example all return 32-bit
vectors and matrices
import neo
let
v1 = makeVector(5, proc(i: int): float32 = (i * i).float32)
v2 = randomVector(7, max = 3'f32) # max is no longer optional, to distinguish 32/64 bit
v3 = constantVector(5, 3.5'f32)
v4 = zeros(8, float32)
v5 = ones(9, float32)
v6 = vector(1'f32, 2'f32, 3'f32, 4'f32, 5'f32)
v7 = vector([1.2'f32, 3.4'f32, 5.6'f32])
m1 = makeMatrix(6, 3, proc(i, j: int): float32 = (i + j).float32)
m2 = randomMatrix(2, 8, max = 1.6'f32)
m3 = constantMatrix(3, 5, 1.8'f32, order = rowMajor) # order is optional, default colMajor
m4 = ones(3, 6, float32)
m5 = zeros(5, 2, float32)
m6 = eye(7, float32)
m7 = matrix(@[
@[1.2'f32, 3.5'f32, 4.3'f32],
@[1.1'f32, 4.2'f32, 1.7'f32]
])
One can convert precision with to32
or to64
:
let
v64 = randomVector(10)
v32 = v64.to32()
m32 = randomMatrix(3, 8, max = 1'f32)
m64 = m32.to64()
Once vectors and matrices are created, everything is inferred, so there are no differences in working with 32-bit or 64-bit. All examples that follow are for 64-bit, but they would work as well for 32-bit.
1.2.1.3. Accessors
Vectors can be accessed as expected:
var v = randomVector(6)
v[4] = 1.2
echo v[3]
Same for matrices, where m[i, j]
denotes the item on row i
and column j
,
regardless of the matrix order:
var m = randomMatrix(3, 7)
m[1, 3] = 0.8
echo m[2, 2]
One can also map vectors and matrices via a proc:
let
v1 = v.map(proc(x: float64): float64 = 2 - 3 * x)
m1 = m.map(proc(x: float64): float64 = 1 / x)
1.2.1.4. Slicing
The row
and column
procs will return vectors that share memory with their
parent matrix:
let
m = randomMatrix(10, 10)
r2 = m.row(2)
c5 = m.column(5)
Similarly, one can slice a matrix with the familiar notation:
let
m = randomMatrix(10, 10)
m1 = m[2 .. 4, 3 .. 8]
m2 = m[All, 1 .. 5]
where All
is a placeholder that denotes that no slicing occurs on that
dimension.
In general it is convenient to have slicing, rows and columns that do not copy data but share the underlying data sequence. This can have two possible drawbacks:
- the result may need to be modified while the original matrix stays unchanged, or viceversa;
- a small matrix or vector may hold a reference to a large data sequence, preventing it to be garbage collected.
In this case, it is enough to call the .clone()
proc to obtain a copy
of the matrix or vector with its own storage.
1.2.1.5. Iterators
One can iterate over vector or matrix elements, as well as over rows and columns
let
v = randomVector(6)
m = randomMatrix(3, 5)
for x in v: echo x
for i, x in v: echo i, x
for x in m: echo x
for t, x in m:
let (i, j) = t
echo i, j, x
for row in m.rows:
echo row[0]
for column in m.columns:
echo column[1]
One important point about performance. When iterating over rows or columns,
the same ref
is reused throughout - this entails that the loop is
allocation-free, resulting in orders of magnitude faster loops. One should
pay attention not to hold to these references, because they will be mutated.
This means that - for instance - the following is correct:
let m = randomMatrix(1000, 1000)
var columnSum = zeros(1000)
for c in m.columns =
columnSum += c
but the following will give wrong results (all elements of cols
will be
identical at the end):
let m = randomMatrix(1000, 1000)
var cols = newSeq[Vector[float64]]()
for c in m.columns =
cols.add(c)
If one needs a fresh reference for each element of the iteration, the
rowsSlow
and columnSlow
iterators are available, hence the
following modification is ok:
let m = randomMatrix(1000, 1000)
var cols = newSeq[Vector[float64]]()
for c in m.columnsSlow =
cols.add(c)
1.2.1.6. Equality
There are two kinds of equality. The usual ==
operator will compare the
contents of vector and matrices exactly
let
u = vector(1.0, 2.0, 3.0, 4.0)
v = vector(1.0, 2.0, 3.0, 4.0)
w = vector(1.0, 3.0, 3.0, 4.0)
u == v # true
u == w # false
Usually, though, one wants to take into account the errors introduced by
floating point operations. To do this, use the =~
operator, or its
negation !=~
:
let
u = vector(1.0, 2.0, 3.0, 4.0)
v = vector(1.0, 2.000000001, 2.99999999, 4.0)
u == v # false
u =~ v # true
1.2.1.7. Pretty-print
Both vectors and matrix have a pretty-print operation, so one can do
let m = randomMatrix(3, 7)
echo m8
and get something like
[ [ 0.5024584865674662 0.0798945419892334 0.7512423051567048 0.9119041361916302 0.5868388894943912 0.3600554448403415 0.4419034543022882 ]
[ 0.8225964245706265 0.01608615513584155 0.1442007939324697 0.7623388321096165 0.8419745686508193 0.08792951865247645 0.2902529012579151 ]
[ 0.8488187232786935 0.422866666087792 0.1057975175658363 0.07968277822379832 0.7526946339452074 0.7698915909784674 0.02831893268471575 ] ]
1.2.1.8. Reshape operations
The following operations do not change the underlying memory layout of matrices and vectors. This means they run in very little time even on big matrices, but you have to pay attention when mutating matrices and vectors produced in this way, since the underlying data is shared.
let
m1 = randomMatrix(6, 9)
m2 = randomMatrix(9, 6)
v1 = randomVector(9)
echo m1.t # transpose, done in constant time without copying
echo m1 + m2.t
let m3 = m1.reshape(9, 6)
let m4 = v1.asMatrix(3, 3)
let v2 = m2.asVector
In case you need to allocate a copy of the original data, say in order to
transpose a matrix and then mutate the transpose without altering the original
matrix, a clone
operation is available:
let m5 = m1.clone
Notice that clone()
will be called internally anyway when using one of the
reshape operations with a matrix that is not contiguous (that is, a matrix
obtained by slicing).
There is also a hard transpose operation which, unlike t()
will not try
to share storage but will always create a new matrix instead and copy the
data to the new matrix (this way, it will also preserve the row-major or
colum-major order). The hard transpose is denoted T()
, so that
m.t == m.T
always holds, although the internal representations differ.
1.2.1.9. BLAS Operations
A few linear algebra operations are available, wrapping BLAS libraries:
var v1 = randomVector(7)
let
v2 = randomVector(7)
m1 = randomMatrix(6, 9)
m2 = randomMatrix(9, 7)
echo 3.5 * v1
v1 *= 2.3
echo v1 + v2
echo v1 - v2
echo v1 * v2 # dot product
echo v1 |*| v2 # Hadamard (component-wise) product
echo l_1(v1) # l_1 norm
echo l_2(v1) # l_2 norm
echo m2 * v1 # matrix-vector product
echo m1 * m2 # matrix-matrix product
echo m1 |*| m2 # Hadamard (component-wise) product
echo max(m1)
echo min(v2)
1.2.1.10. Universal functions
Universal functions are real-valued functions that are extended to vectors and matrices by working element-wise. There are many common functions that are implemented as universal functions:
sqrt
cbrt
log10
log2
log
exp
arccos
arcsin
arctan
cos
cosh
sin
sinh
tan
tanh
erf
erfc
lgamma
tgamma
trunc
floor
ceil
degToRad
radToDeg
This means that, for instance, the following check passes:
let
v1 = vector(1.0, 2.3, 4.5, 3.2, 5.4)
v2 = log(v1)
v3 = v1.map(log)
assert v2 == v3
Universal functions work both on 32 and 64 bit precision, on vectors and matrices.
If you have a function f
of type proc(x: float64): float64
you can use
makeUniversal(f)
to turn f
into a (public) universal function. If you do not want to export
f
, there is the equivalent template makeUniversalLocal
.
1.2.1.11. Rewrite rules
A few rewrite rules allow to optimize a chain of linear algebra operations into a single BLAS call. For instance, if you try
echo v1 + 5.3 * v2
this is not implemented as a scalar multiplication followed by a sum, but it is turned into a single function call.
1.2.1.12. Stacking vectors and matrices
Vectors can be stacked both horizontally (which gives a new vector)
let
v1 = vector([1.0, 2.0])
v2 = vector([5.0, 7.0, 9.0])
v3 = vector([9.9, 8.8, 7.7, 6.6])
echo hstack(v1, v2, v3) # vector([1.0, 2.0, 5.0, 7.0, 9.0, 9.9, 8.8, 7.7, 6.6])
or vertically (which gives a matrix having the vectors as rows)
let
v1 = vector([1.0, 2.0, 3.0])
v2 = vector([5.0, 7.0, 9.0])
v3 = vector([9.9, 8.8, 7.7])
echo vstack(v1, v2, v3)
# matrix(@[
# @[1.0, 2.0, 3.0],
# @[5.0, 7.0, 9.0],
# @[9.9, 8.8, 7.7]
# ])
Also, concat
is an alias for hstack
.
Matrices can be stacked similarly, for instance
let
m1 = matrix(@[
@[1.0, 2.0],
@[3.0, 4.0]
])
m2 = matrix(@[
@[5.0, 7.0, 9.0],
@[6.0, 2.0, 1.0]
])
m3 = matrix(@[
@[2.0, 2.0],
@[1.0, 3.0]
])
echo hstack(m1, m2, m3)
# m = matrix(@[
# @[1.0, 2.0, 5.0, 7.0, 9.0, 2.0, 2.0],
# @[3.0, 4.0, 6.0, 2.0, 1.0, 1.0, 3.0]
# ])
TODO: stack matrices
1.2.1.13. Solving linear systems
Some linear algebraic functions are included, currently for solving systems of
linear equations of the form Ax = b
, for square matrices A
. Functions to invert
square invertible matrices are also provided. These throw floating-point errors
in the case of non-invertible matrices.
These functions require a LAPACK implementation.
let
a = randomMatrix(5, 5)
b = randomVector(5)
echo solve(a, b)
echo a \ b # equivalent
echo a.inv()
1.2.1.14. Computing eigenvalues and eigenvectors
These functions require a LAPACK implementation.
To be documented.
1.2.2. Sparse linear algebra
To be documented.
1.3. Working on the GPU
1.3.1. Dense linear algebra
If you have a matrix or vector, you can move it on the GPU, and back like this:
import neo, neo/cuda
let
v = randomVector(12, max=1'f32)
vOnTheGpu = v.gpu()
vBackOnTheCpu = vOnTheGpu.cpu()
Vectors and matrices on the GPU support linear-algebraic operations via cuBLAS, exactly like their CPU counterparts. A few operation - such as reading a single element - are not supported, as it does not make much sense to copy a single value back and forth from the GPU. Usually it is advisable to move vectors and matrices to the GPU, make as many computations as possible there, and finally move the result back to the CPU.
The following are all valid operations, assuming v
and w
are vectors on the
GPU, m
and n
are matrices on the GPU and the dimensions are compatible:
v * 3'f32
v + w
v -= w
m * v
m - n
m * n
For more information, look at the tests in tests/cudadense
.
1.3.2. Sparse linear algebra
To be documented.
1.4. Static typing for dimensions
Under neo/statics
there exist types that encode vectors and matrices whose
dimensions are known at compile time. They are defined as aliases of their
dynamic counterparts:
type
StaticVector*[N: static[int]; A] = distinct Vector[A]
StaticMatrix*[M, N: static[int]; A] = distinct Matrix[A]
In this way, these types are fully interoperable with the dynamic ones. One can freely convert between the two representations:
import neo, neo/statics
let
u = randomVector(5) # static, of known dimension 5
v = u.asDynamic
w = v.asStatic(5)
assert(u == w)
All operations implemented by neo are also avaiable for static vectors and matrices. The difference are that:
- operations on static vectors and matrices will not compile if the dimensions do not match
- operations on static vectors and matrices will return other static vectors and matrices, thereby automatically tracking dimensions.
An example of an operation that will not compile is
import neo, neo/statics
let
m = statics.randomMatrix(5, 7) # static, of known dimension 5x7
n = statics.randomMatrix(4, 6) # static, of known dimension 4x6
p = statics.randomMatrix(7, 3) # static, of known dimension 7x3
discard m * n # this will not compile
let x = m * p # this will infer dimension 5x3
By converting back and forth between static and dynamic vectors and matrices - which can be done at no cost - one can incorporate data whose dimension is only known at runtime, while at the same time having guaranteed dimension compatibility whenever enough information is known at compile time.
For now, statics are only available on the CPU. It would be a nice contribution to extend this to GPU types.
1.5. Design
1.5.1. On the CPU
On the CPU, dense vectors and matrices are stored using this structure:
type
MatrixShape* = enum
Diagonal, UpperTriangular, LowerTriangular, UpperHessenberg, LowerHessenberg, Symmetric
Vector*[A] = ref object
data*: seq[A]
fp*: ptr A # float pointer
len*, step*: int
Matrix*[A] = ref object
order*: OrderType
M*, N*, ld*: int # ld = leading dimension
fp*: ptr A # float pointer
data*: seq[A]
shape*: set[MatrixShape]
Each store some information on dimensions (len
for vectors, M
and N
for
matrices) and a pointer to the beginning of the actual data fp
.
The order
of a matrix can be colMajor
or rowMajor
: the first one means
that the matrix is stored column by column, the second row by row.
To make it easier to share data without copying, but still keep the data
garbage collected, the actual data is usually allocated in a seq
, here called
data
, which can be shared between matrices and their slices, or row and
column vectors. The pointer fp
is usually a pointer somewhere inside this
sequence, although this is not required.
All operations are expressed in terms of fp
, so data
is not really
required. When the last reference to data
goes away, though, the sequence
is garbage collected and there will be no more pointers inside it. If there is
a small vector or matrix holding the last reference to a big chunk of
data, it may be more convenient to copy it to a fresh location and free the
data itself: this can be done by using the clone()
operation.
Vectors are not required to be contiguous, and they have a step
parameter
that determines how far apart are their elements. This is useful when
taking a row
of a column major matrix or the column
of a row major one.
Matrices can also not be contiguous. When taking a minor of a column major
matrix, one gets a submatrix whose elements are contiguous in a column, but
whose column are not contiguously placed. Rather, the distance (in elements)
between the start of two successive columns is the same as the parent matrix,
and is called the leading dimension of the matrix (here stored as ld
). A
similar remark holds for row major matrices, where ld
is the number of
elements between the beginning of rows.
This design allows to have matrices or vectors that are not managed by the
garbage collector. In this case, it is enough to set fp
manually, and
leave data
nil. This allows to support
- matrices and vectors with data on the stack, which can be constructed
using the
stackVector
andstackMatrix
constructors (and which are only valid as long as the relevant data lives on the stack), and - matrices and vectors allocated manually on the shared heap, which can
be constructed using the
sharedVector
andsharedMatrix
constructors, and destructed withdealloc
.
1.5.2. Why fields are public
Notice that all members of the types are public, but in general it is not safe to change them if you don't know what you are doing. These fields are not generally meant to be changed, and a previous version of the library had them private. In general, though, a user may need access to some of these fields for performance reasons, so they are exposed.
An example of this case is the rows
(or columns
) iterator. Neo keeps
vector and matrix types on the heap (they are ref
types). This prevents
accidental copying, but has the downside that creating a new one requires
allocation. When iterating over rows in a loop, one wants to avoid to trigger
many costly allocations, since the underlying data is always the same, and
the only thing that changes is the position of the vectors inside this
data array. For this reason, the iterator is implemented as follows:
iterator rows*[A](m: Matrix[A]): auto {. inline .} =
let
mp = cast[CPointer[A]](m.fp)
step = if m.order == rowMajor: m.ld else: 1
var v = m.row(0)
yield v
for i in 1 ..< m.M:
v.fp = addr(mp[i * step])
yield v
There is a single vector which is reused at each step and the iterator
always yields this vector, where fp
is changed. A user that wants - say -
to implement a similar iteration over some minors of a matrix may need
to perform a similar trick, and preventing to change fp
would impede
this optimization.
1.5.3. On the GPU
On the GPU side, the definitions are similar:
type
CudaVector*[A] = object
data*: ref[ptr A]
fp*: ptr A
len, step*: int32
CudaMatrix*[A] = object
M*, N*, ld*: int32
data*: ref[ptr A]
fp*: ptr A
shape*: set[MatrixShape]
The main difference here is that one cannot store the underlying data in a sequence, because data is allocated on a device, and the CUDA api returns the relevant pointers, over which we have no control.
To have a similar approach to the former case, the CUDA pointer to the data
is wrapped inside a ref
. The actual pointer used in computation is, again,
fp
, while data
is shared for slices, or rows and vectors of a matrix.
When the last reference to data
goes away, a finalizer calls the CUDA
routines to clean up the allocated memory.
Also, CUDA matrices are only column major for now, although this is going to change in the future.
1.6. Linking external libraries
1.6.1. Linking BLAS and LAPACK implementations
Neo requires to link some BLAS and LAPACK implementation to perform the actual linear algebra operations. By default, it tries to link whatever are the default system-wide implementations.
You can link against different implementations by a combination of:
- changing the path for linked libraries (use
--clibdir
for this) - using the
--define:blas
flag. By default, the system tries to load a BLAS library calledblas
, which translates into something calledblas.dll
orlibblas.so
according to the underling operating system. To link, say, the librarylibopenblas.so.3
on Linux, you should pass to Nim the option--define:blas=openblas
. - using the
--define:lapack
flag. By default, the system tries to load a LAPACK library calledlapack
, which translates into something calledlapack.dll
orliblapack.so
according to the underling operating system. To link, say, the librarylibopenblas.so.3
on Linux, you should pass to Nim the option--define:lapack=openblas
.
See the tasks inside neo.nimble for a few examples.
Packages for various BLAS or LAPACK implementations are available from the package
managers of many Linux distributions. On OSX one can add the brew formulas
from Homebrew Science, such
as brew install homebrew/science/openblas
.
You may also need to add suitable paths for the includes and library dirs. On OSX, this should do the trick
switch("clibdir", "/usr/local/opt/openblas/lib")
switch("cincludes", "/usr/local/opt/openblas/include")
If you have problems with MKL, you may want to link it statically. Just pass the options
--dynlibOverride:mkl_intel_lp64
--passL:${PATH_TO_MKL}/libmkl_intel_lp64.a
to enable static linking.
On Windows, it is recommended to use MSYS2 to install the mingw compiler toolchain and compatible OpenBLAS library. For 64-bit builds, this would be:
pacman -S mingw-w64-x86_64-gcc mingw-w64-x86_64-openblas
You should then add MSYS2_ROOT\mingw64\bin
to your PATH. Programs using nimblas
can then be compiled using the -d:blas=libopenblas
switch. At runtime, libopenblas,dll
should be loaded from the mingw64 bin directory you added to your PATH, though it
is suggested to distribute this DLL file alongside your executable if your are
publishing binary packages.
1.6.2. Linking CUDA
It is possible to delegate work to the GPU using CUDA. The library has been tested to work with NVIDIA CUDA 8.0, but it is possible that earlier versions will work as well. In order to compile and link against CUDA, you should make the appropriate headers and libraries available. If they are not globally set, you can pass suitable options to the Nim compiler, such as
--cincludes:"/usr/local/cuda/include"
--clibdir:"/usr/local/cuda/lib64"
Support for CUDA is under the package neo/cuda
, that needs to be imported
explicitly.
1.7. TODO
See the issue list
1.8. Contributing
Every contribution is very much appreciated! This can range from:
- using the library and reporting any issues and any configuration on which it works fine
- building other parts of the scientific environment on top of it
- writing blog posts and tutorials
- helping with the documentation
- contributing actual code (see the issue list section)