Home

Awesome

tf-adaptive-softmax-lstm-lm

This repository shows the experiment result of LSTM language models on PTB (Penn Treebank) and GBW (Google One Billion Word) using AdaptiveSoftmax on TensorFlow.

Adaptive Softmax

The adaptive softmax is a faster way to train a softmax classifier over a huge number of classes, and can be used for both training and prediction. For example, it can be used for training a Language Model with a very huge vocabulary, and the trained languaed model can be used in speech recognition, text generation, and machine translation very efficiently.

Tha adaptive softmax has been used in the ASR system developed by Tencent AI Lab, and achieved about 20x speed up than full sotfmax in the second pass for rescoing.

See Efficient softmax approximation for GPUs for detail about the adaptive softmax algorithms.

Implementation

The implementation of AdaptiveSoftmax on TensorFlow can be found here: TencentAILab/tensorflow

Usage

Train with AdaptiveSoftmax:

python train_lm.py --data_path=ptb_data --gpuid=0 --use_adaptive_softmax=1

Train with full softmax:

python train_lm.py --data_path=ptb_data --gpuid=0 --use_adaptive_softmax=0

Experiment results

Language Modeling on PTB

With the hyper parameters below, it takes 5min54s to train 20 epochs on PTB corpus, the final perplexity on test set is 88.51. With the same parameters and using full softmax, it takes 6min57s to train 20 epochs, and the final perplexity on test set is 89.00.

Since the PTB vocabulary size is only 10K, the speed up is not that significant.

hyper parameters:

epoch_num = 20
train_batch_size = 128
train_step_size = 20
valid_batch_size = 128
valid_step_size = 20
test_batch_size = 20
test_step_size = 1
word_embedding_dim = 512
lstm_layers = 1
lstm_size = 512
lstm_forget_bias = 0.0
max_grad_norm = 0.25
init_scale = 0.05
learning_rate = 0.2
decay = 0.5
decay_when = 1.0
dropout_prob = 0.5
adagrad_eps = 1e-5
vocab_size = 10001
softmax_type = "AdaptiveSoftmax"
adaptive_softmax_cutoff = [2000, vocab_size]

result:

EpochElapseTrain PPLValid PPLTest PPL
10min18s376.407169.152164.039
20min35s154.324132.648127.494
30min53s117.210118.547113.197
41min11s98.662111.791106.373
51min28s87.366107.808102.588
61min46s79.448105.028100.024
72min04s73.749103.70598.220
82min21s69.392102.93996.931
92min39s62.737100.17494.043
102min57s59.42399.41293.153
113min15s56.63497.60091.271
123min32s55.03697.38891.061
133min50s54.00296.12789.796
144min08s53.23296.17089.805
154min25s52.84495.46189.130
164min43s52.48895.08588.788
175min01s52.31494.90588.615
185min18s52.17294.83588.553
195min36s52.03894.80688.526
205min54s51.99894.78888.510

Language Modeling on Google 1Billion Word corpus

hyper parameters:

word_embedding_dim = 256
train_batch_size = 256
train_step_size = 20
valid_batch_size = 256
valid_step_size = 20
test_batch_size = 128
test_step_size = 1
lstm_layers = 1
lstm_size = 2048
lstm_forget_bias = 1.0
max_grad_norm = 0.25
init_scale = 0.05
learning_rate = 0.1
decay = 0.5
decay_when = 1.0
dropout_prob = 0.01
adagrad_eps = 1e-5
vocab_size = 793471
softmax_type = "AdaptiveSoftmax"
adaptive_softmax_cutoff = [4000,40000,200000, vocab_size]

result:

On GBW corpus, we achived a perplexcity of 43.24 after 5 epochs, taking about two days to train on 2 GPUs with synchronous gradient updates.

EpochElapseTrain PPLValid PPLTest PPL
19h56min51.42852.72749.553
219h53min45.14148.68345.639
329h51min42.60547.37944.332
439h48min41.11946.82243.743
549h45min38.75746.40243.241
659h42min37.66446.33443.119
769h40min37.13946.33743.101
879h37min36.88446.34243.097

##Reference

[1] Grave E, Joulin A, Cissé M, et al. Efficient softmax approximation for GPUs[J]. arXiv preprint arXiv:1609.04309, 2016.

[2] https://github.com/facebookresearch/adaptive-softmax