Home

Awesome

RMSNorm

short for Root Mean Square Layer Normalization

RMSNorm is a simplification of the original layer normalization (LayerNorm). LayerNorm is a regularization technique that might handle the internal covariate shift issue so as to stabilize the layer activations and improve model convergence. It has been proved quite successful in NLP-based model. In some cases, LayerNorm has become an essential component to enable model optimization, such as in the SOTA NMT model Transformer.

One application of LayerNorm is on recurrent neural networks. Nonetheless, we observe that LayerNorm raises computational overhead per running step, which diminishes the net efficiency gain from faster and more stable training, as shown in the Figure below.

<p align="center"> <img src="./rnn_layernorm.svg" height="300"> <em>Training procedure of a GRU-based RNNSearch for the first 10k training steps. Baseline means the original model without any normalization. When the Baseline training loss arrives at 7.0, the loss of LayerNorm reaches 5.4 after the same number of training steps (left figure), but only 5.9 after the same training time (right figure).</em> </p>

RMSNorm simplifies LayerNorm by removing the mean-centering operation, or normalizing layer activations with RMS statistic:

$$ \begin{align} \begin{split} & \bar{a}i = \frac{a_i}{\text{RMS}(\mathbf{a})} g_i, \quad \text{where}~~ \text{RMS}(\mathbf{a}) = \sqrt{\frac{1}{n} \sum{i=1}^{n} a_i^2}. \end{split}\nonumber \end{align} $$

When the mean of the inputs is exactly 0, then LayerNorm equals to RMSNorm. We also observe that the RMS statistic can be estimated from partial inputs, based on the iid assumption. Below shows the comparision of LayerNorm and RMSNorm in different properties.

Weight matrix re-scalingWeight matrix re-centeringWeight Vector re-scalingDataset re-scalingDataset re-centeringSingle training case re-scaling
BatchNorm
WeightNorm
LayerNorm
RMSNorm
pRMSNorm

As RMSNorm does not consider the mean of the inputs, it's not re-centering invariant. This is the main difference compared to LayerNorm.

But, does it matter abandoning the re-centering invariant property? or does the re-centering invariant property help improve the robustness of LayerNorm? We did an experiment on RNNSearch with Nematus, where we initialize the weights with a center of about 0.2. The figure below suggests that removing re-centering operation in RMSNorm does not hurt its stability.

<p align="center"> <img src="./ininmt.svg" height="300" width="100%"> <em>SacreBLEU score curve of LayerNorm and RMSNorm on newstest2013 (devset) when the initialization center is 0.2.</em> </p>

Requirements

The codes rely on the following packages:

General codes

We provide separate code:

Experiments

We did experiments on four different tasks, including different neural models (RNN/CNN/Transformer), different non-linear activations (linear/sigmoid/tanh/relu), different weight initializations(normal/uniform/orthogonal), and different deep learning frameworks (Theano/Pytorch/Tensorflow). Our experiments involve NLP-related and Image-related tasks. Most of the settings follows those in LayerNorm paper. But from our view, we put more focus on machine translation.

Machine Translation

The machine translation experiments are based on Nematus(v0.3). To run experiments with RMSNorm:

To ease the training of RNNSearch, we also provide the used/preprocessed dataset & training script & pretrained model.

b here is deletable. The training follows a similar way as above.

CNN/Daily Mail Reading Comprehension

We experiment with the bidirectional attentive reader model proposed by Hermann et al. We use the attentive reader model from the repository given by Tim Coojimans et al..

Please follow the steps below:

We train the model using the following command

GPUARRAY_FORCE_CUDA_DRIVER_LOAD=True THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,gpuarray.preallocate=0.8 python -u train_attentive_reader.py \
    --use_dq_sims 1 --use_desc_skip_c_g 0 --dim 240 --learn_h0 1 --lr 8e-5 --truncate -1 --model "lstm_s1.npz" --batch_size 64 --optimizer "adam" --validFreq 1000 --model_dir $MDIR --use_desc_skip_c_g 1  --unit_type rlnlstm --use_bidir 1

Below are the log files from the model trained using RMSNorm:

wget http://data.statmt.org/bzhang/neurips19_rmsnorm/attentive_reader/stats_lstm_s1.npz.pkl

Image-Caption Retrieval

We experiment with order-embedding model proposed by Vendro et al. The code used is available here.

Please follow the steps below:

Available below is a download to the model used to report results in the paper:

wget http://data.statmt.org/bzhang/neurips19_rmsnorm/oe/order.npz
wget http://data.statmt.org/bzhang/neurips19_rmsnorm/oe/order.pkl

Once downloaded, follow the instructions on the main page for evaluating models. Notice that please change the prefix for rlngru model to lngru to use the saved models.

CIFAR-10 Classification

We experiment with the ConvPool-CNN-C architecture proposed by Krizhevsky and Hinton, and follow the settings in WeightNorm. We use the implementation here.

Please follow the steps below:

We use the following command to train the model:

GPUARRAY_FORCE_CUDA_DRIVER_LOAD=True THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,gpuarray.preallocate=0.4 python train.py --norm_type rms_norm --learning_rate 0.003

Running los of our model can be downloaded as below:

wget http://data.statmt.org/bzhang/neurips19_rmsnorm/cifar/results.csv

Citation

If you find the codes useful, please consider cite the following paper:

Biao Zhang; Rico Sennrich (2019). Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32. Vancouver, Canada.

@inproceedings{zhang-sennrich-neurips19,
    address = "Vancouver, Canada",
    author = "Zhang, Biao and Sennrich, Rico",
    booktitle = "Advances in Neural Information Processing Systems 32",
    url = "https://openreview.net/references/pdf?id=S1qBAf6rr",
    title = "{Root Mean Square Layer Normalization}",
    year = "2019"
}

Please feel free to contact me for any questions about our paper.