Home

Awesome

CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks

illustration

model

Data construction

Construct KILT++ with Wikipedia knowledge source and the original KILT dataset.

python utils/construct_dataset.py

Construct prefix tree

Construct prefix tree for D0 to D4.

python utils/construct_trie.py

Train backbone model

Train the backbone model in the initial phase with D0 and R0.

For this procedure, please refer the CorpusBrain repository.

Note that the backbone is trained with fairseq to facilitate efficiency, use the following script to convert the fairseq checkpoint to huggingface version.

python utils/convert_fairseq_huggingface.py [fairseq_path]

Revisit old documents

python replay/kmeans.py

Pre-training tasks

Generate query-document pairs for each specific task.

python tasks/[specific_task]/generate.py

Continual learning

Continually pre-train the adapters with the backbone parameters frozen.

python train_adapter.py --task [task] --batch_size [batch_size] --config_file [config_file] --save_name [save_name] --lr [learning_rate] --max_steps [max_steps] -grad_acc [grad_acc] --eval_steps [eval_steps] --load_adapter_path [load_adapter_path] 

Evaluation

python scripts/eval_all.sh

Citation

@article{guo2024corpusbrain++,
  title={CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks},
  author={Guo, Jiafeng and Zhou, Changjiang and Zhang, Ruqing and Chen, Jiangui and de Rijke, Maarten and Fan, Yixing and Cheng, Xueqi},
  journal={arXiv preprint arXiv:2402.16767},
  year={2024}
}