Home

Awesome

<div align="center"> <h1>Tullio.jl</h1>

GitHub
CI Buildkite GPU CI Tag Version

</div>

Tullio is a very flexible einsum macro. It understands many array operations written in index notation -- not just matrix multiplication and permutations, but also convolutions, stencils, scatter/gather, and broadcasting. For example:

@tullio M[x,y,c] := N[x+i, y+j,c] * K[i,j]     # sum over i,j, and create M

@tullio S[x] = P[x,y] * log(Q[x,y] / R[y])     # sum over y, and write into S

@tullio A[i,j] += B[i,k,l] * C[l,j] * D[k,j]   # sum over k,l, and add to values in A

@tullio (*) Z[j] := X[ind[k],j] * exp(-Y[k])   # product over k

Used by itself the macro writes ordinary nested loops much like Einsum.@einsum. One difference is that it can parse more expressions, and infer ranges for their indices. Another is that it will use multi-threading (via Threads.@spawn) and recursive tiling, on large enough arrays. But it also co-operates with various other packages, provided they are loaded before the macro is called:

The macro also tries to provide a gradient for use with Tracker or (via ChainRules) for Zygote, Yota, etc. <!-- or [ReverseDiff](https://github.com/JuliaDiff/ReverseDiff.jl). --> (Disable with grad=false, or nograd=A.) This is done in one of two ways:

The entire right hand side is summed over the full possible range of any indices not appearing on the left. Pipe operators |> or <| indicate functions to be performed outside the sum, for example:

@tullio lse[j] := log <| exp(mat[i,j])   # vec(log.(sum(exp.(mat), dims=1)))

The option @tullio verbose=true will cause it to print index ranges, symbolic derivatives, and notices when it is unable to use the packages mentioned above. And verbose=2 will print everything.

If it's useful in academic work, you can cite it with this DOI: DOI

Notation

<details>

Index notation for some simple functions:

using Pkg; Pkg.add("Tullio")
using Tullio, Test
M = rand(1:20, 3, 7)

@tullio S[1,c] := M[r,c]  # sum over r ∈ 1:3, for each c ∈ 1:7
@test S == sum(M, dims=1) 

@tullio Q[ρ,c] := M[ρ,c] + sqrt(S[1,c])  # loop over ρ & c, no sum -- broadcasting
@test Q ≈ M .+ sqrt.(S)

mult(M,Q) = @tullio P[x,y] := M[x,c] * Q[y,c]  # sum over c ∈ 1:7 -- matrix multiplication
@test mult(M,Q) ≈ M * transpose(Q)

R = [rand(Int8, 3, 4) for δ in 1:5]

@tullio T[j,i,δ] := R[δ][i,j] + 10im  # three nested loops -- concatenation
@test T == permutedims(cat(R...; dims=3), (2,1,3)) .+ 10im

@tullio (max) X[i] := abs2(T[j,i,δ])  # reduce using max, over j and δ
@test X == dropdims(maximum(abs2, T, dims=(1,3)), dims=(1,3))

dbl!(M, S) = @tullio M[r,c] = 2 * S[1,c]  # write into existing matrix, M .= 2 .* S
dbl!(M, S)
@test all(M[r,c] == 2*S[1,c] for r ∈ 1:3, c ∈ 1:7)

More complicated examples:

using Tullio
A = [abs2(i - 11) for i in 1:21]

# Downsample -- range of i is that allowed by both terms:
@tullio B[i] := (A[2i] + A[2i+1])/2  # 1:10 == intersect(1:10, 0:10)

# Shifts -- range of i calculated in terms of that given for j:
@tullio M[i,j] := A[i+j-1]  (j in 1:15)  # i in 1:7
@tullio M[i+_,j] := A[i+j]  (j in 1:15)  # i in 0:6, automatic shift "i+_"

using OffsetArrays # Convolve a filter:
K = OffsetArray([1,-1,2,-1,1], -2:2)
@tullio C[i] := A[i+j] * K[j]    # j ∈ -2:2 implies i ∈ 3:19

# Index by the values in K
@tullio D[i,j] := A[2K[j]+i] ÷ K[j] # extrema(K)==(-1,2) implies i ∈ 3:17

# Wrapped & padded:
@tullio M[i,j] := A[mod(i+j)]  (j in 1:15, i in 1:15)   # wraps around, mod(i+j, axes(A,1))
@tullio M[i,j] := A[clamp(i+j)]  (j in 1:15, i in 1:15) # instead repeats "100"
@tullio M[i+_,j] := A[pad(i+j, 3)]  (j in 1:15)         # fills with zeros

using FFTW # Functions of the indices are OK:
S = [0,1,0,0, 0,0,0,0]
fft(S) ≈ @tullio F[k] := S[x] * exp(-im*pi/8 * (k-1) * x)  (k ∈ axes(S,1))

# Finalisers <| or |> are applied after sum (the two are equivalent):
@tullio N2[j] := sqrt <| M[i,j]^2     # N2 ≈ map(norm, eachcol(M))
@tullio n3[_] := A[i]^3  |> (_)^(1/3) # n3[1] ≈ norm(A,3), with _ anon. func.

# Reduction over any function:
@tullio (*) P[i] := A[i+k]  (k in 0:2) # product
@tullio (max) X[i,_] := D[i,j]         # maximum(D, dims=2), almost

min1(x,y) = ifelse(first(x) < first(y), x, y); # findmin(D, dims=1), almost:
@tullio (min1) Ts[j+_] := (D[i,j], (i,j))  init=(typemax(Int), (0,0))

# Access to fields & arrays -- this uses j ∈ eachindex(first(N).c)
N = [(a=i, b=i^2, c=fill(i^3,3)) for i in 1:10]
@tullio T[i,j] := (N[i].a // 1, N[i].c[j])

# Functions which create arrays are evaluated once:
@tullio R[i,j] := abs.((rand(Int8, 5)[i], rand(Int8, 5)[j]))

using NamedDims, AxisKeys # Dimension names, plus pretty printing:
@tullio M[row=i, col=j, z=k] := A[i+j-1]  (j in 1:15, k in 1:2)
@tullio S[i] := M[col=j-i, z=k, row=i+1] # sum over j,k
</details>

Fast & Slow

<details>

When used with LoopVectorization, on straightforward matrix multiplication of real numbers, @tullio tends to be about as fast as OpenBLAS. Depending on the size, and on your computer. Here's a speed comparison on mine: v2.5.

This race is a useful diagnostic, but isn't really the goal. There is little point in avoiding using BLAS libraries, if you want precisely what they are optimised to give you. One of the things @tullio is often very fast at is weird tensor contractions, for which you would otherwise need permutedims:

using Tullio, LoopVectorization, NNlib, BenchmarkTools

# Batched matmul, with batch index first in B:
bmm_rev(A, B) = @tullio C[i,k,b] := A[i,j,b] * B[b,k,j]  # (sum over j)

A = randn(20,30,500); B = randn(500,40,30);
bmm_rev(A, B) ≈ NNlib.batched_mul(A, permutedims(B, (3,2,1)))  # true

@btime bmm_rev($A, $B);  # 317.526 μs μs, same speed as un-permuted
@btime NNlib.batched_mul($A, permutedims($B, (3,2,1)));  # 1.478 ms, with MKL

Complex numbers aren't handled by LoopVectorization, so will be much slower.

Chained multiplication is also very slow, because it doesn't know there's a better algorithm. Here it just makes 4 loops, instead of multiplying sequentially, 30^4 instead of 2 * 30^3 operations:

M1, M2, M3 = randn(30,30), randn(30,30), randn(30,30);
@btime $M1 * $M2 * $M3;                                   #  3.525 μs
@btime @tullio M4[i,l] := $M1[i,j] * $M2[j,k] * $M3[k,l]; # 30.401 μs

Or slightly less obviously:

M, Σ = randn(100,100), randn(100,100);
@tullio R4[i, j] := (M[μ, i] - M[μ,j])' * Σ[μ,ν] * (M[ν, i] - M[ν, j]);
begin
  S = M' * Σ * M  # two N^3 operations, instead of one N^4
  @tullio R3[i,j] := S[i,i] + S[j,j] - S[i,j] - S[j,i]
end;
R3 ≈ R4

Another thing Tullio can be very fast at is broadcast reductions, where it can avoid large allocations. Here LoopVectorization is speeding up log, and Tullio is handling tiled memory access and multi-threading:

sum_opp(X, Y=X) = @tullio s := X[i,j] * log(Y[j,i])
sum_part(X, Y=X) = @tullio S[i] := X[i,j] * log(Y[j,i])

X = rand(1000,1000);
@btime sum_opp($X)                    #   499.814 μs (93 allocations: 3.97 KiB)
@btime sum($X .* log.(transpose($X))) # 8.759 ms (2 allocations: 7.63 MiB)

@btime sum_part($X)'                           #  1.599 ms (not the same computer!)
@btime sum($X .* log.(transpose($X)), dims=2)  # 13.292 ms

At present indices using pad, clamp or mod are also slow. These result in extra checks or operations at every iteration, not just around the edges:

conv1(x,k) = @tullio y[i+_, j+_] := x[i+a, j+b] * k[a,b]
conv2(x,k) = @tullio y[i+_, j+_] := x[2i-a, 2j-b] * k[a,b]
conv3(x,k) = @tullio y[i+_, j+_] := x[pad(i-a,3), pad(j-b,3)] * k[a,b]

x100 = rand(100,100); k7 = randn(7,7);
@btime conv1($x100, $k7); #  25.574 μs
@btime conv2($x100, $k7); #  44.590 μs
@btime conv3($x100, $k7); #  86.228 μs

using Flux
x104 = reshape(x100,(100,100,1,1)); k74 = reshape(k7,(7,7,1,1)); 
conv1(x100, k7) ≈ @btime CrossCor($k74, false)($x104)       # 586.694 μs
conv2(x100, k7) ≈ @btime Conv($k74, false, stride=2)($x104) # 901.573 μs
conv3(x100, k7) ≈ @btime Conv($k74, false, pad=3)($x104)    # 932.658 μs

using DSP
@btime DSP.conv($x100, $k7); # 198.331 μs
</details>

Gradients & GPU

<details><summary><b>Derivatives & GPU</b></summary>
using Tullio
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

A = rand(3,40); B = rand(40,500);
A * B ≈ mul(A, B) # true

using Tracker # or Zygote
ΔA = Tracker.gradient((A,B) -> sum(mul(A, B)), A, B)[1]
ΔA ≈ ones(3,500) * B' # true

using CUDA, KernelAbstractions # Now defined with a GPU version:
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

cu(A * B) ≈ mul(cu(A), cu(B)) # true

cu(ΔA) ≈ Tracker.gradient((A,B) -> sum(mul(A, B)), cu(A), cu(B))[1] # true

# Reduction over min/max:
Tracker.gradient(x -> (@tullio (max) res := x[i]^3), [1,2,3,-2,-1,3])[1]

Some warnings are in order:

</details>

Larger Expressions

<details>

The expression need not be just one line, for example:

@tullio out[x, y] := @inbounds(begin  # sum over k
        a,b = off[k]
        mat[mod(x+a), mod(y+b)]
    end) (x in axes(mat,1), y in axes(mat,2)) grad=Dual nograd=off

Here the macro cannot infer the range of the output's indices x,y, so they must be provided explicitly. (If writing into an existing array, with out[x,y] = begin ... or +=, then ranges would be taken from there.) Because it sees assignment being made, it does not attempt to sum over a,b, and it assumes that indices could go out of bounds so does not add @inbounds for you. (Although in fact mod(x+a) == mod(x+a, axes(mat,1)) is safe.) It will also not be able to take a symbolic derivative, but dual numbers will work fine.

More examples:

using Tullio, OffsetArrays

# A convolution with cyclic indices
mat = zeros(10,10,1); mat[2,2] = 101; mat[10,10] = 1;
@tullio kern[i,j] := 1/(1+i^2+j^2)  (i in -3:3, j in -3:3)

@tullio out[x,y,c] := begin
    xi = mod(x+i, axes(mat,1)) # xi = ... means that it won't be summed,
    # yj = mod(y+j, axes(mat,2))
    @inbounds trunc(Int, mat[xi, mod(y+j), c] * kern[i,j]) # and disables automatic @inbounds,
end (x in 1:10, y in 1:10) # and prevents range of x from being inferred.

# A stencil?
offsets = [(a,b) for a in -2:2 for b in -2:2 if a>=b] # vector of tuples

@tullio out[x,y,1] = begin
        a,b = offsets[k]
        i = clamp(x+a, extrema(axes(mat,1))...)
        # j = clamp(y+b, extrema(axes(mat,2))...) # can be written clamp(y+b)
        @inbounds mat[i, clamp(y+b), 1] * 10
    end # ranges of x,y read from out[x,y,1]

# Applying a vector of functions
fs = [sin, cos, tan]
xs = randn(3,100)
@tullio ys[r,c] := (fs[r])(xs[r,c])

using Zygote, ForwardDiff
rowmap(fs, xs) = @tullio ys[r,c] := (fs[r])(xs[r,c]) grad=Dual nograd=fs
Zygote.gradient(sum∘rowmap, fs, ones(3,2))
[f'(1) for f in fs] # agrees
</details>

Keyword Options

<details>

The default setting is: @tullio threads=true fastmath=true avx=true tensor=true cuda=256 grad=Base verbose=false A[i,j] := ...

Implicit:

Extras:

Macros:

</details>

How it Works

<details>

The following three macros all end up calling the same functions as does C = A * B:

@tensor C[i,j] := A[i,k] * B[k,j]         # TensorOperations.jl
@ein C[i,j] := A[i,k] * B[k,j]            # OMEinsum.jl
@matmul C[i,j] := sum(k) A[i,k] * B[k,j]  # TensorCast.jl

But this one writes its own for-loops:

@einsum C[i,j] := A[i,k] * B[k,j]         # Einsum.jl

expanding out to roughly this:

T = promote_type(eltype(A), eltype(B))
C = Array{T}(undef, size(A,1), size(B,2))
@inbounds for j in 1:size(B,2)
    for i in 1:size(A,1)
        acc = zero(T)
        for k in 1:size(A,2)
            acc += A[i,k] * B[k,j]
        end
        C[i,j] = acc
    end
end

Tullio does something similar, but working through a few functions. Taking a slightly more complicated example, this:

@tullio C[i,j] := tanh <| A[i,k] * B[k,j]

expands to roughly this:

function act!(::Type, C::AbstractArray{T}, A, B, ax_i, ax_j, ax_k, keep=nothing, final=true) where T
    @inbounds @fastmath for i in ax_i
        for j in ax_j
            acc = isnothing(keep) ? zero(T) : C[i,j]
            for k in ax_k
                acc += A[i,k] * B[k,j]
            end
            C[i,j] = isnothing(final) ? acc : tanh(acc)
        end
    end
end

function make(A, B)
    ax_i = axes(A,1)
    ax_j = axes(B,2)
    ax_k = axes(A,2) # and check this is == axes(B,1)
    rhs(A,B,i,j,k) = tanh(A[i,k] * B[k,j])
    T = Core.Compiler.return_type(rhs, eltype.((A,B,1,1,1))) # plus a fallback
    C = similar(A, T, (ax_i, ax_j))
    Tullio.threader(act!, Array{T}, C, (A,B), (ax_i,ax_j), (ax_k,), +, 64^3)
    return C
end

C = Tullio.Eval(make, ∇make)(A, B)

This division allows it to dispatch to other methods of act!: one generated with @avx if LoopVectorization is loaded, and one for ::CuArray if KernelAbstractions is loaded.

It also allows threader to divide the work, calling act! many times, from different threads, on small tiles made by dividing the longest axis (say ax_i) in half, repeatedly. If it divides up ax_k, these are done sequentially, with keep=true on all ranges except the first, and final=nothing on all except the last. But ax_i and ax_j are safe to do in parallel.

Finally, Eval exists to give Zygote and friends somewhere to attach themselves. The gradient calculation looks roughly like this:

@adjoint function (e::Eval)(AB...)
    C = e.fwd(AB...)
    C, ΔC -> e.rev(ΔC, C, AB...)
end

function ∇act!(::Type, ΔC, ΔA, ΔB, C, A, B, ax_i, ax_j, ax_k, keep)
    for k in ax_k, i in ax_i, j in ax_j
        ex = ΔC[i,j] * (1-C[i,j])^2
        ΔA[i,k] += ex * B[k,j]
        ΔB[k,j] += A[i,k] * ex
    end
end

function ∇make(ΔC, C, A, B)
    ΔA = similar(A) .= 0
    ΔB = similar(B) .= 0
    ax_i, ax_k = axes(A); ax_j = axes(B,2)
    Tullio.∇threader(∇act!, Array{T}, (ax_k,), (ax_i, ax_j), nothing)
    return (ΔA, ΔB)
end

In this case, it is the loop over k which can be safely broken among different threads, since both ΔA and ΔB have this index. Both ΔA and ΔB are filled in at once.

Notice that the derivative of y = tanh(z) has been written in terms of y (the final result of the forward pass) but free of z (the result of the sum, which was not saved). If this is not possible, it will fail.

If using grad=Dual, the gradient kernel looks different. This method cannot handle finalisers like tanh above, but for plain @tullio C[i,j] := A[i,k] * B[k,j] it would read:

function ∇act!(::Type, ΔC, ΔA, ΔB, C, A, B, ax_i, ax_j, ax_k, keep)
    eps1 = ForwardDiff.Dual(0, (1,0))
    eps2 = ForwardDiff.Dual(0, (0,1))
    for k in ax_k, i in ax_i, j in ax_j
        res = (A[i,k] + eps1) * (B[k,j] + eps2)
        ΔA[i,k] += ForwardDiff.partials(res, 1) * ΔC[i,j]
        ΔB[k,j] += ForwardDiff.partials(res, 2) * ΔC[i,j]
    end
end

Writing @tullio verbose=2 will print all of these functions out.

Scalar reductions, such as @tullio s := A[i,j] * log(B[j,i]), are slightly different in that the act! function simply returns the sum, i.e. the variable acc above.

</details>

Elsewhere

Back-end friends & relatives:

Front-end near-lookalikes:

Things you can't run: