Home

Awesome

Hopfield Networks is All You Need

Hubert Ramsauer<sup>1</sup>, Bernhard Schäfl<sup>1</sup>, Johannes Lehner<sup>1</sup>, Philipp Seidl<sup>1</sup>, Michael Widrich<sup>1</sup>, Lukas Gruber<sup>1</sup>, Markus Holzleitner<sup>1</sup>, Milena Pavlović<sup>3, 4</sup>, Geir Kjetil Sandve<sup>4</sup>, Victor Greiff<sup>3</sup>, David Kreil<sup>2</sup>, Michael Kopp<sup>2</sup>, Günter Klambauer<sup>1</sup>, Johannes Brandstetter<sup>1</sup>, Sepp Hochreiter<sup>1, 2</sup>

<sup>1</sup> ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria
<sup>2</sup> Institute of Advanced Research in Artificial Intelligence (IARAI)
<sup>3</sup> Department of Immunology, University of Oslo, Norway
<sup>4</sup> Department of Informatics, University of Oslo, Norway


Detailed blog post on this paper as well as the necessary background on Hopfield networks at this link.

The transformer and BERT models pushed the performance on NLP tasks to new levels via their attention mechanism. We show that this attention mechanism is the update rule of a modern Hopfield network with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns,converges with one update, and has exponentially small retrieval errors. The number of stored patterns must be traded off against convergence speed and retrieval error. The new Hopfield network has three types of energy minima (fixed points of the update):

  1. global fixed point averaging over all patterns,
  2. metastable states averaging over a subset of patterns, and
  3. fixed points which store a single pattern.

Transformers learn an attention mechanism by constructing an embedding of patterns and queries into an associative space. Transformer and BERT models operate in their first layers preferably in the global averaging regime, while they operate in higher layers in metastable states. The gradient in transformers is maximal in the regime of metastable states, is uniformly distributed when averaging globally, and vanishes when a fixed point is near a stored pattern. Based on the Hopfield network interpretation, we analyzed learning of transformer and BERT architectures. Learning starts with attention heads that average and then most of them switch to metastable states. However, the majority of heads in the first layers still averages and can be replaced by averaging operations like the Gaussian weighting that we propose. In contrast, heads in the last layers steadily learn and seem to use metastable states to collect information created in lower layers. These heads seem a promising target for improving transformers. Neural networks that integrate Hopfield networks that are equivalent to attention heads outperform other methods on immune repertoire classification, where the Hopfield net stores several hundreds of thousands of patterns.

With this repository, we provide a PyTorch implementation of a new layer called “Hopfield” which allows to equip deep learning architectures with Hopfield networks as new memory concepts.

The full paper is available at https://arxiv.org/abs/2008.02217.

Requirements

The software was developed and tested on the following 64-bit operating systems:

As the development environment, Python 3.8.3 in combination with PyTorch 1.6.0 was used (a version of at least 1.5.0 should be sufficient). More details on how to install PyTorch are available on the official project page.

Installation

The recommended way to install the software is to use pip/pip3:

$ pip3 install git+https://github.com/ml-jku/hopfield-layers

To successfully run the Jupyter notebooks contained in examples, additional third-party modules are needed:

$ pip3 install -r examples/requirements.txt

The installation of the Jupyter software itself is not covered. More details on how to install Jupyter are available at the official installation page.

Usage

To get up and running with Hopfield-based networks, only <i>one</i> argument needs to be set, the size (depth) of the input.

from hflayers import Hopfield

hopfield = Hopfield(input_size=...)

It is also possible to replace commonly used pooling functions with a Hopfield-based one. Internally, a <i>state pattern</i> is trained, which in turn is used to compute pooling weights with respect to the input.

from hflayers import HopfieldPooling

hopfield_pooling = HopfieldPooling(input_size=...)

A second variant of our Hopfield-based modules is one which employs a trainable but fixed lookup mechanism. Internally, one or multiple <i>stored patterns</i> and <i>pattern projections</i> are trained (optionally in a non-shared manner), which in turn are used as a lookup mechanism independent of the input data.

from hflayers import HopfieldLayer

hopfield_lookup = HopfieldLayer(input_size=...)

The usage is as <i>simple</i> as with the main module, but equally <i>powerful</i>.

Examples

Generally, the Hopfield layer is designed to be used to implement or to substitute different layers like:

The folder examples contains multiple demonstrations on how to use the <code>Hopfield</code>, <code> HopfieldPooling</code> as well as the <code>HopfieldLayer</code> modules. To successfully run the contained Jupyter notebooks, additional third-party modules like pandas and seaborn are required.

Disclaimer

Some implementations of this repository are based on existing ones of the official PyTorch repository v1.6.0 and accordingly extended and modified. In the following, the involved parts are listed:

License

This repository is BSD-style licensed (see LICENSE), except where noted otherwise.