Home

Awesome

Introduction

XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.

For a detailed description of technical details and experimental results, please refer to our paper:

XLNet: Generalized Autoregressive Pretraining for Language Understanding

​ Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

​ (*: equal contribution)

​ Preprint 2019

Release Notes

Results

As of June 19, 2019, XLNet outperforms BERT on 20 tasks and achieves state-of-the-art results on 18 tasks. Below are some comparison between XLNet-Large and BERT-Large, which have similar model sizes:

Results on Reading Comprehension

ModelRACE accuracySQuAD1.1 EMSQuAD2.0 EM
BERT-Large72.084.178.98
XLNet-Base80.18
XLNet-Large81.7588.9586.12

We use SQuAD dev results in the table to exclude other factors such as using additional training data or other data augmentation techniques. See SQuAD leaderboard for test numbers.

Results on Text Classification

ModelIMDBYelp-2Yelp-5DBpediaAmazon-2Amazon-5
BERT-Large4.511.8929.320.642.6334.17
XLNet-Large3.791.5527.800.622.4032.26

The above numbers are error rates.

Results on GLUE

ModelMNLIQNLIQQPRTESST-2MRPCCoLASTS-B
BERT-Large86.692.391.370.493.288.060.690.0
XLNet-Base86.891.791.474.094.788.260.289.5
XLNet-Large89.893.991.883.895.689.263.691.8

We use single-task dev results in the table to exclude other factors such as multi-task learning or using ensembles.

Pre-trained models

Released Models

As of <u>July 16, 2019</u>, the following models have been made available:

We only release cased models for now because on the tasks we consider, we found: (1) for the base setting, cased and uncased models have similar performance; (2) for the large setting, cased models are a bit better in some tasks.

Each .zip file contains three items:

Future Release Plan

We also plan to continuously release more pretrained models under different settings, including:

Subscribing to XLNet on Google Groups

To receive notifications about updates, announcements and new releases, we recommend subscribing to the XLNet on Google Groups.

Fine-tuning with XLNet

As of <u>June 19, 2019</u>, this code base has been tested with TensorFlow 1.13.1 under Python2.

Memory Issue during Finetuning

Given the memory issue mentioned above, using the default finetuning scripts (run_classifier.py and run_squad.py), we benchmarked the maximum batch size on a single 16GB GPU with TensorFlow 1.13.1:

SystemSeq LengthMax Batch Size
XLNet-Base64120
...12856
...25624
...5128
XLNet-Large6416
...1288
...2562
...5121

In most cases, it is possible to reduce the batch size train_batch_size or the maximum sequence length max_seq_length to fit in given hardware. The decrease in performance depends on the task and the available resources.

Text Classification/Regression

The code used to perform classification/regression finetuning is in run_classifier.py. It also contains examples for standard one-document classification, one-document regression, and document pair classification. Here, we provide two concrete examples of how run_classifier.py can be used.

From here on, we assume XLNet-Large and XLNet-base has been downloaded to $LARGE_DIR and $BASE_DIR respectively.

(1) STS-B: sentence pair relevance regression (with GPUs)

Notes:

(2) IMDB: movie review sentiment classification (with TPU V3-8)

Notes:

SQuAD2.0

The code for the SQuAD dataset is included in run_squad.py.

To run the code:

(1) Download the SQuAD2.0 dataset into $SQUAD_DIR by:

mkdir -p ${SQUAD_DIR} && cd ${SQUAD_DIR}
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

(2) Perform data preprocessing using the script scripts/prepro_squad.sh.

(3) Perform training and evaluation.

For the best performance, XLNet-Large uses <u>sequence length 512</u> and <u>batch size 48</u> for training.

Alternatively, one can use XLNet-Base with GPUs (e.g. three V100). One set of reasonable hyper-parameters can be found in the script scripts/gpu_squad_base.sh.

RACE reading comprehension

The code for the reading comprehension task RACE is included in run_race.py.

To run the code:

(1) Download the RACE dataset from the official website and unpack the raw data to $RACE_DIR.

(2) Perform training and evaluation:

Using Google Colab

An example of using Google Colab with GPUs has been provided. Note that since the hardware is constrained in the example, the results are worse than the best we can get. It mainly serves as an example and should be modified accordingly to maximize performance.

Custom Usage of XLNet

XLNet Abstraction

For finetuning, it is likely that you will be able to modify existing files such as run_classifier.py, run_squad.py and run_race.py for your task at hand. However, we also provide an abstraction of XLNet to enable more flexible usage. Below is an example:

import xlnet

# some code omitted here...
# initialize FLAGS
# initialize instances of tf.Tensor, including input_ids, seg_ids, and input_mask

# XLNetConfig contains hyperparameters that are specific to a model checkpoint.
xlnet_config = xlnet.XLNetConfig(json_path=FLAGS.model_config_path)

# RunConfig contains hyperparameters that could be different between pretraining and finetuning.
run_config = xlnet.create_run_config(is_training=True, is_finetune=True, FLAGS=FLAGS)

# Construct an XLNet model
xlnet_model = xlnet.XLNetModel(
    xlnet_config=xlnet_config,
    run_config=run_config,
    input_ids=input_ids,
    seg_ids=seg_ids,
    input_mask=input_mask)

# Get a summary of the sequence using the last hidden state
summary = xlnet_model.get_pooled_out(summary_type="last")

# Get a sequence output
seq_out = xlnet_model.get_sequence_output()

# build your applications based on `summary` or `seq_out`

Tokenization

Below is an example of doing tokenization in XLNet:

import sentencepiece as spm
from prepro_utils import preprocess_text, encode_ids

# some code omitted here...
# initialize FLAGS

text = "An input text string."

sp_model = spm.SentencePieceProcessor()
sp_model.Load(FLAGS.spiece_model_file)
text = preprocess_text(text, lower=FLAGS.uncased)
ids = encode_ids(sp_model, text)

where FLAGS.spiece_model_file is the SentencePiece model file in the same zip as the pretrained model, FLAGS.uncased is a bool indicating whether to do uncasing.

Pretraining with XLNet

Refer to train.py for pretraining on TPUs and train_gpu.py for pretraining on GPUs. First we need to preprocess the text data into tfrecords.

python data_utils.py \
	--bsz_per_host=32 \
	--num_core_per_host=16 \
	--seq_len=512 \
	--reuse_len=256 \
	--input_glob=*.txt \
	--save_dir=${SAVE_DIR} \
	--num_passes=20 \
	--bi_data=True \
	--sp_path=spiece.model \
	--mask_alpha=6 \
	--mask_beta=1 \
	--num_predict=85

where input_glob defines all input text files, save_dir is the output directory for tfrecords, and sp_path is a Sentence Piece model. Here is our script to train the Sentence Piece model

spm_train \
	--input=$INPUT \
	--model_prefix=sp10m.cased.v3 \
	--vocab_size=32000 \
	--character_coverage=0.99995 \
	--model_type=unigram \
	--control_symbols=<cls>,<sep>,<pad>,<mask>,<eod> \
	--user_defined_symbols=<eop>,.,(,),",-,–,£,€ \
	--shuffle_input_sentence \
	--input_sentence_size=10000000

Special symbols are used, including control_symbols and user_defined_symbols. We use <eop> and <eod> to denote End of Paragraph and End of Document respectively.

The input text files to data_utils.py must use the following format:

For example, the text input file could be:

This is the first sentence.
This is the second sentence and also the end of the paragraph.<eop>
Another paragraph.

Another document starts here.

After preprocessing, we are ready to pretrain an XLNet. Below are the hyperparameters used for pretraining XLNet-Large:

python train.py
  --record_info_dir=$DATA/tfrecords \
  --train_batch_size=2048 \
  --seq_len=512 \
  --reuse_len=256 \
  --mem_len=384 \
  --perm_size=256 \
  --n_layer=24 \
  --d_model=1024 \
  --d_embed=1024 \
  --n_head=16 \
  --d_head=64 \
  --d_inner=4096 \
  --untie_r=True \
  --mask_alpha=6 \
  --mask_beta=1 \
  --num_predict=85

where we only list the most important flags and the other flags could be adjusted based on specific use cases.