Home

Awesome

<h1 align="center">AdaBelief Optimizer</h1> <h3 align="center">NeurIPS 2020 Spotlight, trains fast as Adam, generalizes well as SGD, and is stable to train GANs.</h3>

Release of package

We have released adabelief-pytorch==0.2.0 and adabelief-tf==0.2.0. Please use the latest version from pip. Source code is available under folder pypi_packages/adabelief_pytorch0.2.0 and pypi_packages/adabelief_tf0.2.0.

Table of Contents

External Links

<a href="https://juntang-zhuang.github.io/adabelief/"> Project Page</a>, <a href="https://arxiv.org/abs/2010.07468"> arXiv </a>, <a href="https://www.reddit.com/r/MachineLearning/comments/jc1fp2/r_neurips_2020_spotlight_adabelief_optimizer">Reddit </a>, <a href="https://twitter.com/JuntangZhuang/status/1316934184607354891">Twitter</a>, <a href="https://www.bilibili.com/video/BV1uy4y1q7RG">BiliBili (中文)</a>, <a href="https://www.bilibili.com/video/BV1vi4y1c71S">BiliBili (Engligh)</a>, <a href="https://youtu.be/oGH7dmwvuaY">Youtube</a>

Link to code for extra experiments with AdaBelief

Update for adabelief-pytorch==0.2.0 (Crucial)

In the next release of adabelief-pytorch, we will modify the default of several arguments, in order to fit the needs of for general tasks such as GAN and Transformer. Please check if you specify these arguments or use the default when upgrade from version 0.0.5 to higher.

Versionepsilonweight_decouplerectify
adabelief-pytorch=0.0.51e-8FalseFalse
latest version 0.2.0>0.0.51e-16TrueTrue

Update for adabelief-tf==0.2.0 (Crucial)

In adabelief-tf==0.1.0, we modify adabelief-tf to have the same feature as adabelief-pytorch, inlcuding decoupled weight decay and learning rate rectification. Furthermore, we will add support for TensorFlow>=2.0 and Keras. The source code is in pypi_packages/adabelief_tf0.1.0. We tested with a text classification task and a word embedding task. The default value is updated, please check if you specify these arguments or use the default when upgrade from version 0.0.1 to higher.:

Versionepsilonweight_decouplerectify
adabelief-tf=0.0.11e-8Not supportedNot supported
latest version 0.2.0>0.0.11e-14Supported (Not an option in arguments)default: True

Quick Guide

Table of Hyper-parameters

Please check if you have specify all arguments and check your version is latest, the default might not be suitable for different tasks, see tables below

Hyper-parameters in PyTorch

Tasklrbeta1beta2epsilonweight_decayweight_decouplerectifyfixed_decayamsgrad
Cifar1e-30.90.9991e-85e-4FalseFalseFalseFalse
ImageNet1e-30.90.9991e-81e-2TrueFalseFalseFalse
Object detection (PASCAL)1e-40.90.9991e-81e-4FalseFalseFalseFalse
LSTM-1layer1e-30.90.9991e-161.2e-6FalseFalseFalseFalse
LSTM 2,3 layer1e-20.90.9991e-121.2e-6.FalseFalseFalseFalse
GAN (small)2e-40.50.9991e-120True=False (decay=0)FalseFalseFalse
SN-GAN (large)2e-40.50.9991e-160True=False (decay=0)TrueFalseFalse
Transformer5e-40.90.9991e-161e-4TrueTrueFalseFalse
Reinforcement (Rainbow)1e-40.90.9991e-100.0True=False (decay=0)TrueFalseFalse
Reinforcement (HalfCheetah-v2)1e-30.90.9991e-120.0True=False (decay=0)TrueFalseFalse

Hyper-parameters in Tensorflow (eps in Tensorflow might need to be larger than in PyTorch)

epsilon is used in a different way in Tensorflow (default 1e-7) compared to PyTorch (default 1e-8), so eps in Tensorflow might needs to be larger than in PyTorch (perhaps 100 times larger in Tensorflow, e.g. eps=1e-16 in PyTorch v.s eps=1e-14 in Tensorflow). But personally I don't have much experience with Tensorflow, it's likely that you need to slightly tune eps.

Installation and usage

1. PyTorch implementations

( Results in the paper are all generated using the PyTorch implementation in adabelief-pytorch package, which is the ONLY package that I have extensively tested for now.) <br>

AdaBelief

Please install latest version (0.2.0), previous version (0.0.5) uses different default arguments.

pip install adabelief-pytorch==0.2.0
from adabelief_pytorch import AdaBelief
optimizer = AdaBelief(model.parameters(), lr=1e-3, eps=1e-16, betas=(0.9,0.999), weight_decouple = True, rectify = False)

Adabelief with Ranger optimizer

pip install ranger-adabelief==0.1.0
from ranger_adabelief import RangerAdaBelief
optimizer = RangerAdaBelief(model.parameters(), lr=1e-3, eps=1e-12, betas=(0.9,0.999))

2. Tensorflow implementation (eps of AdaBelief in Tensorflow is larger than in PyTorch, same for Adam)

pip install adabelief-tf==0.2.0
from adabelief_tf import AdaBeliefOptimizer
optimizer = AdaBeliefOptimizer(learning_rate=1e-3, epsilon=1e-14, rectify=False)
<h2>A quick look at the algorithm</h2> <p align='center'> <img src="imgs/adabelief-algo2.png" width="80%"> </p> <div> Adam and AdaBelief are summarized in Algo.1 and Algo.2, where all operations are element-wise, with differences marked in blue. Note that no extra parameters are introduced in AdaBelief. For simplicity, we omit the bias correction step. Specifically, in Adam, the update direction is <img src="https://render.githubusercontent.com/render/math?math=\frac{m_t}{\sqrt{v_t}}"> , where <img src="https://render.githubusercontent.com/render/math?math=v_t"> is the EMA (Exponential Moving Average) of <img src="https://render.githubusercontent.com/render/math?math=g_t^2">; in AdaBelief, the update direction is <img src="https://render.githubusercontent.com/render/math?math=\frac{m_t}{\sqrt{s_t}}">, where <img src="https://render.githubusercontent.com/render/math?math=s_t"> is the of <img src="https://render.githubusercontent.com/render/math?math=(g_t-m_t)^2">. Intuitively, viewing <img src="https://render.githubusercontent.com/render/math?math=m_t"> as the prediction of <img src="https://render.githubusercontent.com/render/math?math=g_t">, AdaBelief takes a large step when observation <img src="https://render.githubusercontent.com/render/math?math=g_t"> is close to prediction <img src="https://render.githubusercontent.com/render/math?math=m_t">, and a small step when the observation greatly deviates from the prediction. </div>

Reproduce results in the paper

(Comparison with 8 other optimizers: SGD, Adam, AdaBound, RAdam, AdamW, Yogi, MSVAG, Fromage)

See folder PyTorch_Experiments, for each subfolder, execute sh run.sh. See readme.txt in each subfolder for visualization, or refer to jupyter notebook for visualization.

Results on Image Recognition

<p align="center"> <img src="./imgs/image_recog.png" width="80%"/> </p>

Results on GAN training

Results on a small GAN with vanilla CNN generator

<p align="center"> <img src="./imgs/GAN.png" width="80%"/> </p>

Results on Spectral Normalization GAN with a ResNet generator

<p align="center"> <img src="./imgs/sn-gan.png" width="80%"/> </p>

Results on LSTM

<p align="center"> <img src="./imgs/lstm.png" width="80%"/> </p>

Results on Transformer

<p align="center"> <img src="./imgs/transformer.png" width="60%"/> </p>

Results on Toy Example

<p align="center"> <img src="./imgs/Beale2.gif" width="80%"/> </p>

Discussions

Installation

Please install the latest version from pip, old versions might suffer from bugs. Source code for up-to-date package is available in folder pypi_packages.

Discussion on hyper-parameters

AdaBelief uses a different denominator from Adam, and is orthogonal to other techniques such as recification, decoupled weight decay, weight averaging et.al. This implies when you use some techniques with Adam, to get a good result with AdaBelief you might still need those techniques.

Discussion on algorithms

1. Weight Decay:
2. Epsilon:

AdaBelief seems to require a different epsilon from Adam. In CV tasks in this paper, epsilon is set as 1e-8. For GAN training it's set as 1e-16. We recommend try different epsilon values in practice, and sweep through a large region. We recommend use eps=1e-8 when SGD outperforms Adam, such as many CV tasks; recommend eps=1e-16 when Adam outperforms SGD, such as GAN and Transformer. Sometimes you might need to try eps=1e-12, such as in some reinforcement learning tasks.

3. Rectify (argument rectify in AdaBelief):

Whether to turn on the rectification as in RAdam. The recitification basically uses SGD in early phases for warmup, then switch to Adam. Rectification is implemented as an option, but is never used to produce results in the paper.

4. AMSgrad (argument amsgrad (default: False) in AdaBelief):

Whether to take the max (over history) of denominator, same as AMSGrad. It's set as False for all experiments.

5. Details to reproduce results
6. Learning rate schedule

The experiments on Cifar is the same as demo in AdaBound, with the only difference is the optimizer. The ImageNet experiment uses a different learning rate schedule, typically is decayed by 1/10 at epoch 30, 60, and ends at 90. For some reasons I have not extensively experimented, AdaBelief performs good when decayed at epoch 70, 80 and ends at 90, using the default lr schedule produces a slightly worse result. If you have any ideas on this please open an issue here or email me.

7. Some experience with RNN

I got some feedbacks on RNN on reddit discussion, here are a few tips:

8. Contact

Please contact me at j.zhuang@yale.edu or open an issue here if you would like to help improve it, especially the tensorflow version, or explore combination with other methods, some discussion on the theory part, or combination with other methods to create a better optimizer. Any thoughts are welcome!

Update Plan

To do

Done

Citation

@article{zhuang2020adabelief,
  title={AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients},
  author={Zhuang, Juntang and Tang, Tommy and Ding, Yifan and Tatikonda, Sekhar and Dvornek, Nicha and Papademetris, Xenophon and Duncan, James},
  journal={Conference on Neural Information Processing Systems},
  year={2020}
}
@article{zhuang2021acprop,
  title={Momentum Centering and Asynchronous Update for Adaptive Gradient Methods},
  author={Zhuang, Juntang and Ding, Yifan and Tang, Tommy and Dvornek, Nicha and Tatikonda, Sekhar and Duncan, James},
  journal={Conference on Neural Information Processing Systems},
  year={2021}
}