Awesome
BART version of closed-book QA
This is a BART version of sequence-to-sequence model for open-domain QA in a closed-book setup, based on PyTorch and Huggingface's Transformers.
The model is a sequence-to-sequence model that takes a question as an input and outputs the answer, without reading any external resource (e.g. passages). Please refer to Roberts et al., 2020, How Much Knowledge Can You Pack Into the Parameters of a Language Model? to learn more about closed-book QA setup and the original model based on T5. Their code and model checkpoints are available here.
The model is based on BART-large. Please refer to Lewis et al., ACL 2020, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension to learn more about BART.
We experiment with Natural Questions open-domain data (NQ-open), but the code should work on any QA data with question-answer pairs.
Requirement
This code is tested on Python 3.6.9.
Install PyTorch and Transformers:
pip install torch==1.1.0
pip install git+https://github.com/huggingface/transformers.git@7b75aa9fa55bee577e2c7403301ed31103125a35
Download NQ-open data:
chmod +x download_data.sh; ./download_data.sh
Training
python cli.py --do_train --output_dir out/nq-bart-closed-qa \
--train_file data/nqopen-train.json \
--predict_file data/nqopen-dev.json \
--train_batch_size ${train_bs} \
--predict_batch_size ${test_bs} \
--append_another_bos
The script will save the log and the best checkpoint inside out/nq-bart-closed-qa
.
Other useful commands (please refer to cli.py
for the full list):
eval_period
: interval to evaluate on the dev dataverbose
: print a progress bardebug
: train and evaluate on a subset of the dev data for debugging purposes
You can use train_batch_size
and predict_batch_size
depending on the gpu availability. With one 16GB gpu, you can use train_batch_size=64, predict_batch_size=64
.
Our model that we reports the result below was trained with train_batch_size=1024, predict_batch_size 256
using eight 32GB gpus. Training took roughly 34 hours.
Note:
- This script saves the pre-tokenized data in
data/
once question-answer pairs are tokenized for the first time. - The model gives the best result when prepending extra BOS token (
--append_another_bos
). - Inference on multi-gpus is not working for now; we will update the code once it is fixed.
Inference
python cli.py --do_predict --output_dir out/nq-bart-closed-qa \
--predict_file data/nqopen-dev.json \
--predict_batch_size ${test_bs} \
--append_another_bos --prefix dev_
python cli.py --do_predict --output_dir out/nq-bart-closed-qa \
--predict_file data/nqopen-test.json \
--predict_batch_size ${test_bs} \
--append_another_bos --prefix test_
It will save the prediction file as out/nq-bart-closed-qa/{dev|test}_predictions.json
.
Result
The final Exact Match score we get is 25.05 on the dev data and 24.10 on the test data.
We made the best model checkpoint and the predictions on the dev/test data available.
Note that T5-based model gets 27.0, 29.8, 32.1 and 34.5 on the test set with Base, Large, 3B and 11B, respectively, based on the original paper. Several factors could lead to the performance gaps: (i) T5 has a larger number of parameters and trained on a larger set of data and (ii) the original paper includes the dev data for training, whereas our codebase only trains the model on the train data and uses the dev data for choosing the best checkpoint. We also did not perform any hyperparamter tuning, as our goal is to provide the basic codebase rather than to achieve the best possible performance; we leave it for the future work.
Note: that the original paper includes ablations that exclude supervised data for T5 pretraining, and reports comparable (or better) numbers: see Appendix C of the original paper for the details!
Contact
Please email Sewon Min or write a Github issue for any question.