Awesome
UIE
- Code for
Unified Structure Generation for Universal Information Extraction
- Please contact Yaojie Lu (@luyaojie) for questions and suggestions.
Update
- [2022-06-12] Update pre-training code.
- [2022-05-10] Update data preprocessing code.
Requirements
General
- Python (verified on 3.8)
- CUDA (verified on 11.1/10.2)
Python Packages CUDA 10.2
conda create -n uie python=3.8
conda install -y pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
CUDA 11.1
conda create -n uie python=3.8
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
Quick Start
Datasets of Extraction Tasks
Details of preprocessing see Data preprocessing.
After that, please link the preprocessed dataset as:
ln -s dataset_processing/converted_data/ data
Data Format
Data folder contains seven files:
data/text2spotasoc/absa/14lap
├── entity.schema # Entity Types for converting SEL to Record
├── relation.schema # Relation Types for converting SEL to Record
├── event.schema # Event Types for converting SEL to Record
├── record.schema # Spot/Asoc Type for constructing SSI
├── test.json
├── train.json
└── val.json
train/val/test.json are data files, and each line is a JSON instance.
Each JSON instance contains text
and record
fields, in which text
is plain text, and record
is the SEL representation of the extraction structure.
Details definition see DATASETS.md.
Note:
- Use the extra character of T5 as the structure indicators, such as
<extra_id_0>
,<extra_id_1>
,<extra_id_5>
.
Token | Role |
---|---|
<extra_id_0> | Start of Label Name |
<extra_id_1> | End of Label Name |
<extra_id_2> | Start of Input Text |
<extra_id_5> | Start of Text Span |
<extra_id_6> | NULL span for Rejection |
-
record.schema
is the record schema file for building SSI. It contains three lines: the first line is spot name list, the second line is asoc name list. And the third line is spot-to-asoc dictionary (do not use in code, can be ignored).["aspect", "opinion"] ["neutral", "positive", "negative"] {"aspect": ["neutral", "positive", "negative"], "opinion": []}
Pretrained Models
You can find the pre-trained models as following CAS Cloud Box/Google Drive links or download models using command gdown
(pip install gdown
).
uie-en-base [CAS Cloud Box] [Google Drive] [Huggingface]
uie-en-large [CAS Cloud Box] [Google Drive] [Huggingface]
uie-char-small (chinese) [CAS Cloud Box]
# Example of Google Drive
gdown 12Dkh6KLDPvXrkQ1I-1xLqODQSYjkwnvs && unzip uie-base-en.zip
gdown 15OFkWw8kJA1k2g_zehZ0pxcjTABY2iF1 && unzip uie-large-en.zip
Put all models to hf_models/
for default running scripts.
Model Fine-tuning
First make directories otuput
.
Training scripts as follows:
run_uie_finetune.py
: Python code entryrun_uie_finetune.bash
: Model training and evaluating process script.scripts_exp/run_exp.bash
: Model environment configuration and parameter setting entry.
The command for the training is as follows (see bash scripts and Python files for the corresponding command-line arguments):
. config/data_conf/base_model_conf_absa.ini && model_name=uie-base-en dataset_name=absa/14lap bash scripts_exp/run_exp.bash
config/data_conf/base_model_conf_absa.ini
refers to using the training settings inbase_model_conf_absa.ini
model_name=uie-base-en
refers to using uie-base-en.dataset_name=absa/14lap
refers to the dataset path.
Trained models are saved in the output_dir
specified by run_uie_finetune.bash
.
Simple Training Command
bash run_uie_finetune.bash -v -d 0 \
-b 16 \
-k 3 \
--lr 1e-4 \
--warmup_ratio 0.06 \
-i absa/14lap \
--epoch 50 \
--spot_noise 0.1 \
--asoc_noise 0.1 \
-f spotasoc \
--epoch 50 \
--map_config config/offset_map/closest_offset_en.yaml \
-m hf_models/uie-base-en \
--random_prompt
Progress logs
...
***** Running training *****
Num examples = 906
Num Epochs = 50
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 2850
Num examples = 219
Batch size = 64
...
Final Result (specific scores may different from different machines and environments)
...
test offset-rel-strict-P 67.01461377870564
test offset-rel-strict-R 59.11602209944752
test offset-rel-strict-F1 62.81800391389433
...
Metric | Definition |
---|---|
ent-(P/R/F1) | Micro-F1 of Entity (Entity Type, Entity Span) |
rel-strict-(P/R/F1) | Micro-F1 of Relation Strict (Relation Type, Arg1 Span, Arg1 Type, Arg2 Span, Arg2 Type) |
rel-boundary-(P/R/F1) | Micro-F1 of Relation Boundary (Relation Type, Arg1 Span, Arg2 Span) |
evt-trigger-(P/R/F1) | Micro-F1 of Event Trigger (Event Type, Trigger Span) |
evt-role-(P/R/F1) | Micro-F1 of Event Argument (Event Type, Arg Role, Arg Span) |
Model Pre-training
[TODO] Add detailed decription.
Data Collator
We construct different sequence-to-sequence tasks using different data collators.
- For pre-training,
HybirdDataCollator
constructs different seq2seq pairs for different tasks, andDataCollatorForMetaSeq2Seq
constructs ssi with Sampling Strategy. - For fine-tuning,
DataCollatorForMetaSeq2Seq
constructs the dynamic seq2seq pair with Rejection Mechanism.
HybirdDataCollator
We unify different types of (text, strcuture) pairs for pre-training with HybirdDataCollator. It contains multiple data collators for different instances:
DataCollatorForMetaSeq2Seq
for pair task, similiar to fine-tune stageDataCollatorForSeq2Seq
for record taskDataCollatorForT5MLM
for text task
DataCollatorForMetaSeq2Seq
Sampling Strategy and Rejection Mechanism can be adopted in the training process.
uie/seq2seq/data_collator/meta_data_collator.py
class DataCollatorForMetaSeq2Seq is for collating data, class DynamicSSIGenerator is for prompt samplingrun_uie_finetune.py
class DataTrainingArguments contains related parameters
Related parameters in class DataTrainingArguments are briefly introduced here:
- About Sampling Strategy
- max_prefix_length Maximum length of SSI
- ordered_prompt Whether to sort the spot/asoc of SSI or not
- record_schema record schema read from record.schema
- About Rejection Mechanism
- spot_noise The noise rate of null spot
- asoc_noise The noise rate of null asoc
Scripts for Model Evaluation
To verify the performance of the UIE requires converting the generated SEL expression into Record and then evaluating it.
1. Convert structured expressions to record structures (sel2record.py)
After training, pred_folder
will contain 'eval_preds_seq2seq.txt' or 'test_preds_seq2seq.txt'
$ python scripts/sel2record.py -h
usage: sel2record.py [-h] [-g GOLD_FOLDER] [-p PRED_FOLDER [PRED_FOLDER ...]] [-c MAP_CONFIG] [-d DECODING] [-v]
optional arguments:
-h, --help show this help message and exit
-g GOLD_FOLDER folder of golden answer
-p PRED_FOLDER [PRED_FOLDER ...]
multiple different prediction folders
-c MAP_CONFIG, --config MAP_CONFIG
offset matching strategy configuration file, more configuration files are placed in config/offset_map
-d DECODING specify structure parser, default is SpotAsoc structure
-v, --verbose print more detailed log information
2. Validate model performance (eval_extraction.py)
After converting, pred_folder
will contain 'eval_preds_record.txt' or 'test_preds_record.txt'
$ python scripts/eval_extraction.py -h
usage: eval_extraction.py [-h] [-g GOLD_FOLDER] [-p PRED_FOLDER [PRED_FOLDER ...]] [-v] [-w] [-m] [-case]
optional arguments:
-h, --help show this help message and exit
-g GOLD_FOLDER Golden Dataset folder
-p PRED_FOLDER [PRED_FOLDER ...]
Predicted model folder
-v Show more information during running
-w Write evaluation results to predicted folder
-m Refers to the matching policy
-case Show case study
3. Verify the performance of the mapping label (check_offset_map_gold_as_pred.bash)
To verify the effect of structure parser, we took the golden answer SEL
as the prediction result, and evaluate its performance.
bash scripts/check_offset_map_gold_as_pred.bash <data-folder> <map-config>
Citation
If this repository helps you, please cite this paper:
Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, Hua Wu. Unified Structure Generation for Universal Information Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5755–5772, Dublin, Ireland. Association for Computational Linguistics.
@inproceedings{lu-etal-2022-unified,
title = "Unified Structure Generation for Universal Information Extraction",
author = "Lu, Yaojie and
Liu, Qing and
Dai, Dai and
Xiao, Xinyan and
Lin, Hongyu and
Han, Xianpei and
Sun, Le and
Wu, Hua",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.395",
pages = "5755--5772",
}
License
The code is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License for Noncommercial use only. Any commercial use should get formal permission first.