Home

Awesome

minGPT

mingpt

A PyTorch re-implementation of GPT training. minGPT tries to be small, clean, interpretable and educational, as most of the currently available ones are a bit sprawling. GPT is not a complicated model and this implementation is appropriately about 300 lines of code, including boilerplate and a totally unnecessary custom causal self-attention module. Anyway, all that's going on is that a sequence of indices goes into a sequence of transformer blocks, and a probability distribution of the next index comes out. The rest of the complexity is just being clever with batching (both across examples and over sequence length) so that training is efficient.

The core minGPT "library" (hah) is two files: mingpt/model.py contains the actual Transformer model definition and mingpt/trainer.py is (GPT-independent) PyTorch boilerplate that trains the model. The attached Jupyter notebooks then show how the "library" (hah) can be used to train sequence models:

With a bpe encoder, distributed training and maybe fp16 this implementation may be able to reproduce GPT-1/GPT-2 results, though I haven't tried $$$. GPT-3 is likely out of reach as my understanding is that it does not fit into GPU memory and requires a more careful model-parallel treatment.

Example usage

This code is simple enough to just hack inline, not "used", but current API looks something like:


# you're on your own to define a class that returns individual examples as PyTorch LongTensors
from torch.utils.data import Dataset, DataLoader
train_dataset = MyDataset(...)
val_dataset = MyDataset(...)
train_loader = DataLoader(train_dataset)
val_loader = DataLoader(val_dataset)

# construct a GPT model
from mingpt.model import GPT
model = GPT(vocab_size=train_dataset.vocab_size, 
            block_size=train_dataset.block_size,
            n_layer=8, 
            n_head=8, 
            n_embd=512, 
            learning_rate=6e-4)

# construct a trainer
from pytorch_lightning import Trainer
from mingpt.lr_decay import LearningRateDecayCallback

# scheduler
lr_decay = LearningRateDecayCallback(learning_rate=6e-4, warmup_tokens=512*20,
                                    final_tokens=00*len(train_dataset)*block_size)

trainer = Trainer(gpus=1, precision=16, max_epochs=500,
                  gradient_clip_val=1.0, 
                  callbacks=[lr_decay], 
                  progress_bar_refresh_rate=1, 
                  row_log_interval=1)
trainer.fit(model, train_loader, val_loader)
# (... enjoy the show for a while... )

# sample from the model (the [None, ...] and [0] are to push/pop a needed dummy batch dimension)
from mingpt.utils import sample
x = torch.tensor([1, 2, 3], dtype=torch.long)[None, ...] # context conditioning
y = sample(model, x, steps=30, temperature=1.0, sample=True, top_k=5)[0]
print(y) # our model filled in the integer sequence with 30 additional likely integers

References

Code:

Papers + some implementation notes:

Improving Language Understanding by Generative Pre-Training (GPT-1)

Language Models are Unsupervised Multitask Learners (GPT-2)

Language Models are Few-Shot Learners (GPT-3)

License

MIT