Home

Awesome

Fast Hadamard Transform in CUDA, with a PyTorch interface

Features:

How to use

from fast_hadamard_transform import hadamard_transform
def hadamard_transform(x, scale=1.0):
    """
    Arguments:
        x: (..., dim)
        scale: float. Multiply the output by this number.
    Returns:
        out: (..., dim)

    Multiply each row of x by the Hadamard transform matrix.
    Equivalent to F.linear(x, torch.tensor(scipy.linalg.hadamard(dim))) * scale.
    If dim is not a power of 2, we implicitly pad x with zero so that dim is the next power of 2.
    """

Speed

Benchmarked on A100, for not too small batch size, compared to memcpy (torch.clone), which is a lower bound for the time taken as we'd need to read inputs from GPU memory and write output to GPU memory anyway.

Data typeDimensionTime taken vs memcpy
fp16/bf16<= 5121.0x
512 - 8192<= 1.2x
163841.3x
327681.8x
fp32<= 81921.0x
163841.1x
327681.2x