Awesome

KDEformer: Accelerating Transformers via Kernel Density Estimation

KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds. On BigGAN image generation, we achieve better generative scores than the exact computation with over $4\times$ speedup. For ImageNet classification with T2T-ViT, KDEformer shows over $18\times$ speedup while the accuracy drop is less than $0.5%$.

fig_biggan_image_generation_10

How the algorithm works

The KDEformer leverages the body of research on efficient Kernel Density Estimation (KDE). In the KDE problem, we aim to compute the kernel density for an arbitrary query point, q. This needs to be estimated to a relative error in a time complexity proportional to d divided by a lower bound on the kernel density, $µ_X(q)$ . Our technique transforms the problem of finding the sampling matrix and diagonal scaling that satisfy the attention approximation into a generalization of the KDE problem.

We use an efficient KDE procedure for estimating the exponential kernel density to compute a scaling that satisfies the spectral guarantee of the attention approximation. We also design an efficient sampling matrix that satisfies the approximation with a small number of rows. The sampling probabilities need to be proportional to the column norms of the softmax matrix.

Having a generalized KDE procedure for efficiently evaluating the weighted exponential kernel density enables us to approximate the attention mechanism as per our approximation formula. This approach translates the original problem into a Gaussian KDE problem, which allows us to leverage the significant recent progress in this area.

In this paper, we present a novel Locality-Sensitive Hashing (LSH) algorithm tailored for GPU usage, with a comprehensive explanation included within the main body of the text. To provide an overview of its operation, the algorithm initially segregates the entire space utilizing cosine LSH. This leads to the creation of numerous buckets of diverse sizes, the majority of which are typically on the smaller end of the spectrum.

Following this, the algorithm arranges all the buckets in an order dictated by their Hamming distance, ensuring that neighboring elements have a Hamming distance of less than one. This arrangement ensures that elements bearing strong similarities are positioned adjacent to each other. Post-sorting, the elements are distributed across several blocks. Diagonal blocks are then isolated and multiplied as separate units.

Experiments

The codes include two experiments; (1) attention approximation on GloVe dataset and (2) BigGAN image geneartion.
The proposed algorithm is implemented in KDEformer.py.
For running ScatterBrain, install the pacakge from https://github.com/HazyResearch/fly.
For GloVe experiment, download and preparse dataset:

wget https://nlp.stanford.edu/data/glove.twitter.27B.zip
unzip glove.twitter.27B.zip

and run

python demo_glove.py

For BigGAN image generation, run

python demo_biggan.py --attention METHOD

where METHOD can be one of exact (default), kdeformer, performer, reformer, sblocal. If ScatterBrain is not installed, the option with sblocal would not run.

Awesome

KDEformer: Accelerating Transformers via Kernel Density Estimation

How the algorithm works

Experiments

Citation