Awesome

Efficient Softmax Approximation

Implementations of Blackout and Adaptive Softmax for efficiently calculating word distribution for language modeling of very large vocabularies.

LSTM language models are derived from rnnlm_chainer.

Available output layers are as follows

Linear + softmax with cross entropy loss. A usual output layer.
--share-embedding: A variant using the word embedding matrix shared with the input layer for the output layer.
--adaptive-softmax: Adaptive softmax
--blackout: BlackOut (BlackOut is not faster on GPU.)

Adaptive Softmax

Efficient softmax approximation for GPUs
Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou, ICML 2017
paper
authors' Lua code

BlackOut

BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies
Shihao Ji, S. V. N. Vishwanathan, Nadathur Satish, Michael J. Anderson, Pradeep Dubey, ICLR 2016
paper
authors' C++ code

How to Run

python -u train.py -g 0

Datasets

PennTreeBank
Wikitext-2
Wikitext-103

For wikitext, run prepare_wikitext.sh for downloading the datasets.