Home

Awesome

PyTorch Pretrained Bert

This repository contains an op-for-op PyTorch reimplementation of Google's TensorFlow repository for the BERT model that was released together with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

This implementation is provided with Google's pre-trained models, examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.

Content

SectionDescription
InstallationHow to install the package
OverviewOverview of the package
UsageQuickstart examples
DocDetailed documentation
ExamplesDetailed examples on how to fine-tune Bert
NotebooksIntroduction on the provided Jupyter Notebooks
TPUNotes on TPU support and pretraining scripts
Command-line interfaceConvert a TensorFlow checkpoint in a PyTorch dump

Installation

This repo was tested on Python 3.5+ and PyTorch 0.4.1

With pip

PyTorch pretrained bert can be installed by pip as follows:

pip install pytorch_pretrained_bert

From source

Clone the repository and run:

pip install [--editable] .

A series of tests is included in the tests folder and can be run using pytest (install pytest if needed: pip install pytest).

You can run the tests with the command:

python -m pytest -sv tests/

Overview

This package comprises the following classes that can be imported in Python and are detailed in the Doc section of this readme:

The repository further comprises:

Usage

Here is a quick-start example using BertTokenizer, BertModel and BertForMaskedLM class with Google AI's pre-trained Bert base uncased model. See the doc section below for all the details on these classes.

First let's prepare a tokenized input with BertTokenizer

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenized input
tokenized_text = "Who was Jim Henson ? Jim Henson was a puppeteer"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 6
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['who', 'was', 'jim', 'henson', '?', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

Let's see how to use BertModel to get hidden states

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

# Predict hidden states features for each layer
encoded_layers, _ = model(tokens_tensor, segments_tensors)
# We have a hidden states for each of the 12 layers in model bert-base-uncased
assert len(encoded_layers) == 12

And how to use BertForMaskedLM

# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# Predict all tokens
predictions = model(tokens_tensor, segments_tensors)

# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
assert predicted_token == 'henson'

Doc

Here is a detailed documentation of the classes in the package and how to use them:

Sub-sectionDescription
Loading Google AI's pre-trained weigthsHow to load Google AI's pre-trained weight or a PyTorch saved instance
PyTorch modelsAPI of the six PyTorch model classes: BertModel, BertForMaskedLM, BertForNextSentencePrediction, BertForPreTraining, BertForSequenceClassification or BertForQuestionAnswering
Tokenizer: BertTokenizerAPI of the BertTokenizer class
Optimizer: BertAdamAPI of the BertAdam class

Loading Google AI's pre-trained weigths and PyTorch dump

To load one of Google AI's pre-trained models or a PyTorch saved model (an instance of BertForPreTraining saved with torch.save()), the PyTorch model classes and the tokenizer can be instantiated as

model = BERT_CLASS.from_pretrain(PRE_TRAINED_MODEL_NAME_OR_PATH)

where

If PRE_TRAINED_MODEL_NAME is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links here) and stored in a cache folder to avoid future download (the cache folder can be found at ~/.pytorch_pretrained_bert/).

Example:

model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

PyTorch models

1. BertModel

BertModel is the basic BERT Transformer model with a layer of summed token, position and sequence embeddings followed by a series of identical self-attention blocks (12 for BERT-base, 24 for BERT-large).

The inputs and output are identical to the TensorFlow model inputs and outputs.

We detail them here. This model takes as inputs:

This model outputs a tuple composed of:

An example on how to use this class is given in the extract_features.py script which can be used to extract the hidden states of the model for a given input.

2. BertForPreTraining

BertForPreTraining includes the BertModel Transformer followed by the two pre-training heads:

Inputs comprises the inputs of the BertModel class plus two optional labels:

Outputs:

3. BertForMaskedLM

BertForMaskedLM includes the BertModel Transformer followed by the (possibly) pre-trained masked language modeling head.

Inputs comprises the inputs of the BertModel class plus optional label:

Outputs:

4. BertForNextSentencePrediction

BertForNextSentencePrediction includes the BertModel Transformer followed by the next sentence classification head.

Inputs comprises the inputs of the BertModel class plus an optional label:

Outputs:

5. BertForSequenceClassification

BertForSequenceClassification is a fine-tuning model that includes BertModel and a sequence-level (sequence or pair of sequences) classifier on top of the BertModel.

The sequence-level classifier is a linear layer that takes as input the last hidden state of the first character in the input sequence (see Figures 3a and 3b in the BERT paper).

An example on how to use this class is given in the run_classifier.py script which can be used to fine-tune a single sequence (or pair of sequence) classifier using BERT, for example for the MRPC task.

6. BertForQuestionAnswering

BertForQuestionAnswering is a fine-tuning model that includes BertModel with a token-level classifiers on top of the full sequence of last hidden states.

The token-level classifier takes as input the full sequence of the last hidden state and compute several (e.g. two) scores for each tokens that can for example respectively be the score that a given token is a start_span and a end_span token (see Figures 3c and 3d in the BERT paper).

An example on how to use this class is given in the run_squad.py script which can be used to fine-tune a token classifier using BERT, for example for the SQuAD task.

Tokenizer: BertTokenizer

BertTokenizer perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.

This class has two arguments:

and three methods:

Please refer to the doc strings and code in tokenization.py for the details of the BasicTokenizer and WordpieceTokenizer classes. In general it is recommended to use BertTokenizer unless you know what you are doing.

Optimizer: BertAdam

BertAdam is a torch.optimizer adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:

The optimizer accepts the following arguments:

Examples

Sub-sectionDescription
Training large models: introduction, tools and examplesHow to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models
Fine-tuning with BERT: running the examplesRunning the examples in ./examples: extract_classif.py, run_classifier.py and run_squad.py
Fine-tuning BERT-large on GPUsHow to fine tune BERT large

Training large models: introduction, tools and examples

BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).

To help with fine-tuning these models, we have included five techniques that you can activate in the fine-tuning scripts run_classifier.py and run_squad.py: gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training . For more details on how to use these techniques you can read the tips on training large batches in PyTorch that I published earlier this month.

Here is how to use these techniques in our scripts:

Note: To use Distributed Training, you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see the above mentioned blog post for more details):

python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=$THIS_MACHINE_INDEX --master_addr="192.168.1.1" --master_port=1234 run_classifier.py (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)

Where $THIS_MACHINE_INDEX is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address 192.168.1.1 and an open port 1234.

Fine-tuning with BERT: running the examples

We showcase the same examples as the original implementation: fine-tuning a sequence-level classifier on the MRPC classification corpus and a token-level classifier on the question answering dataset SQuAD.

Before running these examples you should download the GLUE data by running this script and unpack it to some directory $GLUE_DIR. Please also download the BERT-Base checkpoint, unzip it to some directory $BERT_BASE_DIR, and convert it to its PyTorch version as explained in the previous section.

This example code fine-tunes BERT-Base on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80.

export GLUE_DIR=/path/to/glue

python run_classifier.py \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --bert_model bert-base-uncased \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/

Our test ran on a few seeds with the original implementation hyper-parameters gave evaluation results between 84% and 88%.

The second example fine-tunes BERT-Base on the SQuAD question answering task.

The data for SQuAD can be downloaded with the following links and should be saved in a $SQUAD_DIR directory.

export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
  --bert_model bert-base-uncased \
  --do_train \
  --do_predict \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

Training with the previous hyper-parameters gave us the following results:

{"f1": 88.52381567990474, "exact_match": 81.22043519394512}

Fine-tuning BERT-large on GPUs

The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.

For example, fine-tuning BERT-large on SQuAD can be done on a server with 4 k-80 (these are pretty old now) in 18 hours. Our results are similar to the TensorFlow implementation results (actually slightly higher):

{"exact_match": 84.56953642384106, "f1": 91.04028647786927}

To get these results we used a combination of:

Here is the full list of hyper-parameters for this run:

python ./run_squad.py \
  --vocab_file $BERT_LARGE_DIR/vocab.txt \
  --bert_config_file $BERT_LARGE_DIR/bert_config.json \
  --init_checkpoint $BERT_LARGE_DIR/pytorch_model.bin \
  --do_lower_case \
  --do_train \
  --do_predict \
  --train_file $SQUAD_TRAIN \
  --predict_file $SQUAD_EVAL \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $OUTPUT_DIR \
  --train_batch_size 24 \
  --gradient_accumulation_steps 2 \
  --optimize_on_cpu

If you have a recent GPU (starting from NVIDIA Volta series), you should try 16-bit fine-tuning (FP16).

Here is an example of hyper-parameters for a FP16 run we tried:

python ./run_squad.py \
  --vocab_file $BERT_LARGE_DIR/vocab.txt \
  --bert_config_file $BERT_LARGE_DIR/bert_config.json \
  --init_checkpoint $BERT_LARGE_DIR/pytorch_model.bin \
  --do_lower_case \
  --do_train \
  --do_predict \
  --train_file $SQUAD_TRAIN \
  --predict_file $SQUAD_EVAL \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $OUTPUT_DIR \
  --train_batch_size 24 \
  --fp16 \
  --loss_scale 128

The results were similar to the above FP32 results (actually slightly higher):

{"exact_match": 84.65468306527909, "f1": 91.238669287002}

Notebooks

We include three Jupyter Notebooks that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.

Please follow the instructions given in the notebooks to run and modify them.

Command-line interface

A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the BertForPreTraining class (see above).

You can convert any TensorFlow checkpoint for BERT (in particular the pre-trained models released by Google) in a PyTorch save file by using the convert_tf_checkpoint_to_pytorch.py script.

This CLI takes as input a TensorFlow checkpoint (three files starting with bert_model.ckpt) and the associated configuration file (bert_config.json), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using torch.load() (see examples in extract_features.py, run_classifier.py and run_squad.py).

You only need to run this conversion script once to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with bert_model.ckpt) but be sure to keep the configuration file (bert_config.json) and the vocabulary file (vocab.txt) as these are needed for the PyTorch model too.

To run this specific conversion script you will need to have TensorFlow and PyTorch installed (pip install tensorflow). The rest of the repository only requires PyTorch.

Here is an example of the conversion process for a pre-trained BERT-Base Uncased model:

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12

pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
  --tf_checkpoint_path $BERT_BASE_DIR/bert_model.ckpt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin

You can download Google's pre-trained models for the conversion here.

TPU

TPU support and pretraining scripts

TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent official announcement).

We will add TPU support when this next release is published.

The original TensorFlow code further comprises two scripts for pre-training BERT: create_pretraining_data.py and run_pretraining.py.

Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details here) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.