Awesome
On the Diminishing Returns of Width for Continual Learning
This repository contains the code to reproduce the empirical results in the paper On the Diminishing Returns of Width for Continual Learning.
Summary of Results
Abstract
While deep neural networks have demonstrated groundbreaking performance in various settings, these models often suffer from catastrophic forgetting when trained on new tasks in sequence. Several works have empirically demonstrated that increasing the width of a neural network leads to a decrease in catastrophic forgetting but have yet to characterize the exact relationship between width and continual learning. In this paper. we design one of the first frameworks to analyze Continual Learning Theory and prove that width is directly related to forgetting in Feed-Forward Networks (FFN). In particular, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns. We empirically verify our claims at widths hitherto unexplored in prior studies where the diminishing returns are clearly observed as predicted by our theory.
Theoretical Contributions
Our results contribute to the literature examining the relationship between neural network architectures and continual learning performance. We provide one of the first theoretical frameworks for analyzing catastrophic forgetting in Feed-Forward Networks. While our theoretical framework does not perfectly capture all information about continual forgetting empirically, it is a valuable step in analyzing continual learning from a theoretical framework. As predicted by our theoretical framework, we demonstrate empirically that scaling width alone is insufficient for mitigating the effects of catastrophic forgetting, providing a more nuanced understanding of finite-width forgetting dynamics than results achieved in prior studies.
Empirical Validation
Rotated MNIST (1 Layer MLP)
Width | AA | AF | LA | JA |
---|---|---|---|---|
32 | 56.3 | 37.7 | 93.0 | 91.8 |
64 | 58.7 | 36.0 | 93.5 | 93.5 |
128 | 59.8 | 35.0 | 93.8 | 94.3 |
256 | 60.9 | 34.2 | 94.0 | 94.8 |
512 | 61.9 | 33.2 | 94.1 | 95.0 |
1024 | 62.7 | 32.6 | 94.2 | 95.3 |
2048 | 64.1 | 31.2 | 94.3 | 95.5 |
4096 | 65.3 | 30.2 | 94.5 | 95.7 |
8192 | 66.7 | 28.9 | 94.7 | 95.7 |
16384 | 68.0 | 27.9 | 94.9 | 95.9 |
32768 | 69.4 | 26.6 | 95.6 | 96.1 |
65536 | 69.6 | 26.7 | 95.6 | 96.2 |
Rotated Fashion MNIST (1 Layer MLP)
Width | AA | AF | LA | JA |
---|---|---|---|---|
32 | 37.7 | 46.0 | 82.1 | 77.8 |
64 | 37.9 | 46.0 | 82.4 | 80.0 |
128 | 38.2 | 46.0 | 82.5 | 79.4 |
256 | 38.4 | 45.9 | 82.7 | 79.8 |
512 | 38.8 | 45.6 | 82.9 | 79.9 |
1024 | 39.3 | 45.3 | 83.1 | 79.9 |
2048 | 39.9 | 44.8 | 83.3 | 79.1 |
4096 | 40.1 | 44.9 | 83.7 | 80.9 |
8192 | 40.8 | 44.5 | 83.9 | 80.2 |
16384 | 41.4 | 44.3 | 84.5 | 78.8 |
32768 | 41.9 | 44.3 | 84.9 | 79.9 |
65536 | 42.0 | 44.6 | 85.5 | 80.9 |
Getting Started
To begin, first clone the repository. The only external dependencies needed to run the code are PyTorch and torchvision, which can be installed following the instructions on the PyTorch website
Running Experiments
To reproduce any of the experiments from the paper, execute the run.py
script and specify the task name, depth, and the layer width. For example:
python3 run.py --task_name mnist --num_layers 2 --width 1024
python3 run.py --task_name svhn --num_layers 1 --width 64
python3 run.py --task_name gtsrb --num_layers 3 --width 256
Citation
@article{guha2024diminishing,
title={On the Diminishing Returns of Width for Continual Learning},
author={Guha, Etash and Lakshman, Vihan},
journal={arXiv preprint arXiv:2403.06398},
year={2024}
}