Awesome
DeepStruct: Pretraining of Language Models for Structure Prediction
Source code repo for paper DeepStruct: Pretraining of Language Models for Structure Prediction, ACL 2022.
Setup Environment
DeepStruct is based on GLM dependency. Please use GLM's docker as follow to setup the basic GPU environment (zxdu20/glm-cuda112
for Ampere GPUs and zxdu20/glm-cuda102
for older version GPUs such as Tesla V100).
git clone --recursive git@github.com:cgraywang/deepstruct.git
cd ./deepstruct
docker run --net=host --privileged --pid=host --gpus all --rm -it --ipc=host -v ./deepstruct:/workspace/deepstruct zxdu20/glm-cuda112
cd /workspace/deepstruct
and install the dependency via setup.sh
:
bash setup.sh
The final directory structure should be as follows:
workspace/
├─ deepstruct/
├─ data/
├─ ckpt/
Download Checkpoints
Most of our experiments are based on 10-billion-parameter DeepStruct checkpoint. Run the following shell scripts to download all multi-task trained DeepStruct checkpoints from huggingface hub (might take a while).
bash download_ckpt.sh
Data Preparation & Reproduce
To run following experiments on DeepStruct-10B, our experiments adopt batch_size_per_gpu=1
and require at least 32 GB GPU memory to run.
The scripts default use --num-gpus-per-node=1
in src/tasks/mt/*.sh
, and if you want to use multiple gpu for acceleration, please customize it in src/tasks/mt/*.sh
.
Notice that CoNLL12
, CoNLL05
for semantic role labeling, ACE2005
for event extraction require manual download from LDC (LDC2006T06, LDC2013T19, PTB-3).
Task | Dataset | Data preparation | Multi-task Result |
---|---|---|---|
Joint entity and relation extraction | CoNLL04 | bash run_scripts/conll04.sh | Ent. 88.4/Rel. 72.8 |
Joint entity and relation extraction | ADE | bash run_scripts/ade.sh | Ent. 90.5/Rel. 83.6 |
Joint entity and relation extraction | NYT | bash run_scripts/nyt.sh | Ent. 95.4/Rel. 93.7 |
Joint entity and relation extraction | ACE2005 | bash run_scripts/ace2005_jer.sh <abs_path_to_LDC2006T06> | Ent. 90.2/Rel. 58.9 |
Semantic role labeling | CoNLL05 WSJ | bash run_scripts/conll05_srl_wsj.sh <abs_path_to_PTB_3> | 95.5 |
Semantic role labeling | CoNLL05 Brown | bash run_scripts/conll05_srl_brown.sh <abs_path_to_PTB_3> | 92.0 |
Semantic role labeling | CoNLL12 | bash run_scripts/conll12_srl.sh <abs_path_to_LDC2013T19> | 97.2 |
Event extraction | ACE2005 | bash run_scripts/ace2005event.sh <abs_path_to_LDC2006T06> | Trigger: Id-72.7/Cl-69.2 Argument: Id-67.5/Cl-63.9 |
Intent detection | ATIS | bash run_scripts/atis.sh | 97.3 |
Intent detection | SNIPS | bash run_scripts/snips.sh | 97.4 |
Dialogue state tracking | MultiWOZ 2.1 | bash run_scripts/multi_woz.sh | 53.5 |
Arguments in running scripts
The arguments in src/tasks/mt/*.sh
configure the training and inference of DeepStruct. Here are their meanings:
--model-type
: the type of model backbone to use. Currently we only supportmodel_blocklm_10B
, which means using the 10-billion DeepStruct model as the backbone.--model-checkpoint
: the path to the directory of DeepStruct checkpoint.--task
: the task being trained or inferenced.--task-epochs
: number of epochs to run. If set to0
, it means evaluation only.--length-penalty
: a hyperparameter to configure the lengths of generated sequences in the beam search.
Scripts for Pretraining
Following the commands below to prepare pretraining data and run training.
# prepare pretraining data
bash data_scripts/PRETRAIN.sh
# run pretraining
cd ./glm/
bash scripts/ds_finetune_seq2seq_pretrain.sh config_tasks/<MODEL_TYPE>.sh config_tasks/pretrain.sh cnn_dm_original
Currently <MODEL_TYPE>
supports model_blocklm_10B_pretrain
, which refers to the 10 billion pretrained model as backbone.
Please customize NUM_GPUS_PER_WORKER
in glm/scripts/ds_finetune_seq2seq_pretrain.sh
and train_micro_batch_size_per_gpu
in glm/config_tasks/config.json
according to your environment, as fine-tuning a 10B language model requires quite sufficient GPU memory.
The data preprocessing for pretraining may require over 600G main memory, as the current dataloader implementation preloads all tokenized data into main memory in pretraining.
Citation
@inproceedings{wang-etal-2022-deepstruct,
title = "{D}eep{S}truct: Pretraining of Language Models for Structure Prediction",
author = "Wang, Chenguang and
Liu, Xiao and
Chen, Zui and
Hong, Haoyun and
Tang, Jie and
Song, Dawn",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
year = "2022",
}