Home

Awesome

layer-norm

Code and models from the paper "Layer Normalization".

Dependencies

To use the code you will need:

Along with the Theano version described below, we also include a torch implementation in the torch_modules directory.

Setup

Available is a file layers.py which contain functions for layer normalization (LN) and 4 RNN layers: GRU, LSTM, GRU+LN and LSTM+LN. The GRU and LSTM functions are added to show what differs from the functions that use LN.

Below we describe how to integrate these functions into existing Github respositories that will allow you to perform the same experients as in the paper. We also make available the trained models that we used to compute curves and numbers in the paper.

Ideally in the future these will be made as PRs to the corresponding repositories.

NOTE: it is highly encouraged to use CNMeM when using layer norm. Just add cnmem = 1 to your Theano flags.

Order-embeddings

The order-embeddings experiments make use of the respository from Ivan Vendrov et al available here. To train order-embeddings with layer normalization:

Available below is a download to the model used to report results in the paper:

wget http://www.cs.toronto.edu/~rkiros/lngru_order_add0.npz
wget http://www.cs.toronto.edu/~rkiros/lngru_order_add0.pkl

Once downloaded, follow the instructions on the main page for evaluating models. This will allow you to reproduce the numbers reported in the table for order-embeddings+LN.

Skip-thoughts

The skip-thoughts experiments make use of the repository from Jamie Ryan Kiros et al available here. To train skip-thoughts with layer normalization:

Below is the skip-thoughts model trained for 1 month using layer normalization:

wget http://www.cs.toronto.edu/~rkiros/lngru_may13_1700000.npz
wget http://www.cs.toronto.edu/~rkiros/lngru_may13_1700000.npz.pkl

Once downloaded, follow Step 4 in the training directory to load the model. This model will allow you to reproduce the reported results in the last row of the table. Step 5 describes how to use the model to encode new sentences into vectors.

Attentive-reader

The attentive reader experiment makes use of the repository from Tim Cooijmans et al here. To train an attentive reader model:

Clone the above repository and obtain the data:

wget http://www.cs.toronto.edu/~rkiros/top4.zip

Here is the command I used for training. See train_baseline.sh for context. Some of the settings in the repository have changed since our experiment:

THEANO_FLAGS=device=gpu0,lib.cnmem=0.85 python -u -O ${SDIR}/train_attentive_reader.py --use_dq_sims 1 --use_desc_skip_c_g 0 --dim 240 --learn_h0 1 --lr 8e-5 --truncate -1 --model "lstm_s1.npz" --batch_size 64 --optimizer "adam" --validFreq 1000 --model_dir $MDIR --use_desc_skip_c_g 1  --unit_type lnlstm

Below are the log files from the model trained using layer normalization:

wget http://www.cs.toronto.edu/~rkiros/stats_dimworda_[240]_datamode_top4_usedqsim_1_useelug_0_validFre_1000_clip-c_[10.0]_usebidir_0_encoderq_lnlstm_dimproj_[240]_use-drop_[True]_optimize_adam_lstm_s1.npz.pkl

These can be used to reproduce the layer norm curve from the paper.