Home

Awesome

DeepStruct: Pretraining of Language Models for Structure Prediction

Source code repo for paper DeepStruct: Pretraining of Language Models for Structure Prediction, ACL 2022.

Setup Environment

DeepStruct is based on GLM dependency. Please use GLM's docker as follow to setup the basic GPU environment (zxdu20/glm-cuda112 for Ampere GPUs and zxdu20/glm-cuda102 for older version GPUs such as Tesla V100).

git clone --recursive git@github.com:cgraywang/deepstruct.git
cd ./deepstruct

docker run --net=host --privileged --pid=host --gpus all --rm -it --ipc=host -v ./deepstruct:/workspace/deepstruct zxdu20/glm-cuda112
cd /workspace/deepstruct

and install the dependency via setup.sh:

bash setup.sh

The final directory structure should be as follows:

workspace/
├─ deepstruct/
├─ data/
├─ ckpt/

Download Checkpoints

Most of our experiments are based on 10-billion-parameter DeepStruct checkpoint. Run the following shell scripts to download all multi-task trained DeepStruct checkpoints from huggingface hub (might take a while).

bash download_ckpt.sh

Data Preparation & Reproduce

To run following experiments on DeepStruct-10B, our experiments adopt batch_size_per_gpu=1 and require at least 32 GB GPU memory to run. The scripts default use --num-gpus-per-node=1 in src/tasks/mt/*.sh, and if you want to use multiple gpu for acceleration, please customize it in src/tasks/mt/*.sh.

Notice that CoNLL12, CoNLL05 for semantic role labeling, ACE2005 for event extraction require manual download from LDC (LDC2006T06, LDC2013T19, PTB-3).

TaskDatasetData preparationMulti-task Result
Joint entity and relation extractionCoNLL04bash run_scripts/conll04.shEnt. 88.4/Rel. 72.8
Joint entity and relation extractionADEbash run_scripts/ade.shEnt. 90.5/Rel. 83.6
Joint entity and relation extractionNYTbash run_scripts/nyt.shEnt. 95.4/Rel. 93.7
Joint entity and relation extractionACE2005bash run_scripts/ace2005_jer.sh <abs_path_to_LDC2006T06>Ent. 90.2/Rel. 58.9
Semantic role labelingCoNLL05 WSJbash run_scripts/conll05_srl_wsj.sh <abs_path_to_PTB_3>95.5
Semantic role labelingCoNLL05 Brownbash run_scripts/conll05_srl_brown.sh <abs_path_to_PTB_3>92.0
Semantic role labelingCoNLL12bash run_scripts/conll12_srl.sh <abs_path_to_LDC2013T19>97.2
Event extractionACE2005bash run_scripts/ace2005event.sh <abs_path_to_LDC2006T06>Trigger: Id-72.7/Cl-69.2 Argument: Id-67.5/Cl-63.9
Intent detectionATISbash run_scripts/atis.sh97.3
Intent detectionSNIPSbash run_scripts/snips.sh97.4
Dialogue state trackingMultiWOZ 2.1bash run_scripts/multi_woz.sh53.5

Arguments in running scripts

The arguments in src/tasks/mt/*.sh configure the training and inference of DeepStruct. Here are their meanings:

Scripts for Pretraining

Following the commands below to prepare pretraining data and run training.

# prepare pretraining data
bash data_scripts/PRETRAIN.sh

# run pretraining
cd ./glm/
bash scripts/ds_finetune_seq2seq_pretrain.sh config_tasks/<MODEL_TYPE>.sh config_tasks/pretrain.sh cnn_dm_original

Currently <MODEL_TYPE> supports model_blocklm_10B_pretrain, which refers to the 10 billion pretrained model as backbone.

Please customize NUM_GPUS_PER_WORKER in glm/scripts/ds_finetune_seq2seq_pretrain.sh and train_micro_batch_size_per_gpu in glm/config_tasks/config.json according to your environment, as fine-tuning a 10B language model requires quite sufficient GPU memory. The data preprocessing for pretraining may require over 600G main memory, as the current dataloader implementation preloads all tokenized data into main memory in pretraining.

Citation

@inproceedings{wang-etal-2022-deepstruct,
    title = "{D}eep{S}truct: Pretraining of Language Models for Structure Prediction",
    author = "Wang, Chenguang  and
      Liu, Xiao  and
      Chen, Zui  and
      Hong, Haoyun  and
      Tang, Jie  and
      Song, Dawn",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
    year = "2022",
}