Awesome
токама́к
Tokamak is an optimising compiler for index-style array expressions. It's main reason for existence is that I'm too lazy to write a several-hundred-line CUDA kernel for every op I need (matmul, softmax etc) – with Tokamak I can describe them in one line and compile them to fast CPU and GPU kernels. Possible future directions:
- Smarter tiling and other locality optimisations
- Emit BLAS-equivalent kernels without special cases
- Compile larger programs (e.g. entire ML models) to GPUs
- Optimise distributed cluster computation, e.g. JuliaDB
Take a couple of familiar examples:
@tk function add(A, B)
C[i] = A[i] + B[i]
end
@tk sum(xs) = reduce(+, 0, xs)
@tk function matmul(A, B)
C[i, j] = sum([k] -> A[i,k]*B[k,j])
end
This is extremely close to traditional mathematical notation. Under the hood, we view arrays as being functions of their indices (hence the "anonymous array" syntax [...] -> ...
). If a Tokamak function returns an array, the generated code will evaluate that array at every point.
julia> cpu(add) |> prettify
function (out, goat, mink)
for elephant = 1:size(goat, 1)
out[elephant] = goat[elephant] + mink[elephant]
end
return out
end
Tokamak infers shapes from the description:
julia> infer(add)
(m) → (m) → (m)
julia> infer(matmul)
(m, n) → (n, o) → (m, o)
Functions can return scalars as well, for example (using a short form syntax):
julia> @tk diag(A)[i] = A[i,i]
julia> @tk trace(A) = sum(diag(A))
julia> infer(trace)
(m, m) → ()
julia> cpu(trace) |> prettify
function (anteater,)
let
sum = 0
for gaur = 1:size(anteater, 1)
sum += anteater[gaur, gaur]
end
sum
end
end
Notice that we do not construct the diag
array. This applies equally well to more complex examples:
julia> @tk tracemul(A,B) = sum(diag(mul(A,B)))
julia> infer(tracemul)
(m, n) → (n, m) → ()
julia> cpu(tracemul) |> prettify
function (duck, leopard)
let
sum = 0.0
for goosander = 1:size(duck, 1)
sum += let
sum = 0.0
for horse = 1:size(duck, 2)
sum += duck[goosander, horse] * leopard[horse, goosander]
end
sum
end
end
sum
end
end
Crucially, we only calculate the elements of the matrix multiply that we actually need. Despite the relative naivety of the generated code, this is enough to get a solid 10x speedup over the equivalent BLAS.
GPU Example
Tokamak supports (extremely early stage, totally untested) GPU compilation. Consider defining enough to do a single layer of an mlp: tanh(x*W + b)
.
@tk add(a, b)[i, j] = a[i, j] + b[i, j]
@tk mul(A,B)[i,j] = sum([k] -> A[i,k]*B[k,j])
@tk act(a)[i, j] = tanh(a[i, j])
@tk net(W, b, x) = act(add(mul(x, W), b))
infer(net) # (m, n) → (o, n) → (o, m) → (o, n)
gpu(net) |> prettify
function (out, waterbuffalo, sealion, otter)
(gnat, oyster) = ((blockIdx().x - 1) * blockDim().x + threadIdx().x,
(blockIdx().y - 1) * blockDim().y + threadIdx().y)
out[gnat, oyster] = tanh(begin
snail = 0
for pig = 1:size(otter, 2)
snail = snail + otter[gnat, pig] * waterbuffalo[pig, oyster]
end
snail
end + sealion[gnat, oyster])
return out
end
So here we have a fused kernel which does the entire operation in one pass.
See the tests for more detailed examples.