Awesome

Fast Hadamard Transform in CUDA, with a PyTorch interface

Features:

Support fp32, fp16, bf16, for dimension up to 32768.
Implicitly pad with zeros if dimension is not a power of 2.

How to use

from fast_hadamard_transform import hadamard_transform

def hadamard_transform(x, scale=1.0):
    """
    Arguments:
        x: (..., dim)
        scale: float. Multiply the output by this number.
    Returns:
        out: (..., dim)

    Multiply each row of x by the Hadamard transform matrix.
    Equivalent to F.linear(x, torch.tensor(scipy.linalg.hadamard(dim))) * scale.
    If dim is not a power of 2, we implicitly pad x with zero so that dim is the next power of 2.
    """

Speed

Benchmarked on A100, for not too small batch size, compared to memcpy (torch.clone), which is a lower bound for the time taken as we'd need to read inputs from GPU memory and write output to GPU memory anyway.

Data type	Dimension	Time taken vs memcpy
fp16/bf16	<= 512	1.0x
	512 - 8192	<= 1.2x
	16384	1.3x
	32768	1.8x
fp32	<= 8192	1.0x
	16384	1.1x
	32768	1.2x