Home

Awesome

span-selection-pretraining

Code to create pre-training data for a span selection pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself.

Pre-trained Models

Available through Hugging Face as:

Load with: AutoConfig.from_pretrained , AutoTokenizer.from_pretrained , AutoModelForQuestionAnswering.from_pretrained. See run_qa.py for example code.

Installation

Data Generation

python WikiExtractor.py --json --filter_disambig_pages --processes 32 --output wikiextracteddir enwiki-20190801-pages-articles-multistream.xml.bz2
python create_passages.py --wikiextracted wikiextracteddir --output wikipassagesdir
java -cp irsimple.jar com.ibm.research.ai.irsimple.MakeIndex wikipassagesdir wikipassagesindex
nohup bash sspt_gen.sh ssptGen wikipassagesdir 2>&1 > querygen.log &
nohup java -cp irsimple.jar com.ibm.research.ai.irsimple.AsyncWriter \
  ssptGen \
  wikipassagesindex 2>&1 > instgen.log &

Training

FIXME: rc_data and span_selection_pretraining require a modified version of pytorch-transformers The adaptations needed are in the process of being worked into this repo and a pull request for pytorch-transformers. Hopefully it is relatively clear how it should work.

python span_selection_pretraining.py \
  --bert_model bert-base-uncased \
  --train_dir ssptGen \
  --num_instances 1000000 \
  --save_model rc_1M_base.bin