Home

Awesome

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset

Latest Results

Previous Results

GPU Hardware Requirement

TypeLM Memory SizeGPU
w/o tied weights~9 GBNvidia 1080 TI, Nvidia Titan X
w/ tied weights [6]~7 GBNvidia 1070 or higher

Hyper-Parameters [3]

ParameterValue
# Epochs5
Training Batch Size128
Evaluation Batch Size1
BPTT20
Embedding Size256
Hidden Size2048
Projection Size256
Tied Embedding + SoftmaxFalse
# Layers1
OptimizerAdaGrad
Learning Rate0.10
Gradient Clipping1.00
Dropout0.01
Weight-Decay (L2 Penalty)1e-6

Setup - Torch Data Format

  1. Download Google Billion Word Dataset for Torch - Link
  2. Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
  3. Install Cython framework and build Log_Uniform Sampler
  4. Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

Setup - Original Data Format

  1. Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

References

  1. Exploring the Limits of Language Modeling Github
  2. Factorization Tricks for LSTM networks Github
  3. Efficient softmax approximation for GPUs Github
  4. Candidate Sampling
  5. Torch GBW
  6. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling