Awesome
GPT Struct Me
This repository hosts the code used for the development of the "GPT Struct Me: Probing GPT Models on Narrative Entity Extraction" paper. The goal of this work was to evaluate the capabilities of generative large language models on the extraction of narrative entities from text. To accomplish that, we used GPT3 and ChatGPT models from OpenAI and evaluate them on the extraction of narrative entities, namely, participants, events, and temporal expressions (timexs). The results obtained indicate that GPT models are competitive with out-of-the-box baseline systems.
Setup
To use this repository, you need to have access to the OpenAI models API and provide your API key in a .env
file. Use the following template for the .env
file:
OPENAI_API_KEY="<your_api_key>"
Development Environment
Follow these steps to set up the development environment:
# Create and activate a virtual environment
virtualenv venv --python=python3.9
source venv/bin/activate
# Install required dependencies
pip install -r requirements.txt
pip install -e .
Data
The repository utilizes the Text2Story Lusa Corpus for experiments. To replicate the experiments, download the corpus and place it in a resources
folder. The directory structure should resemble the following:
resources
└── lusa_news
├── lusa_0.ann
├── lusa_0.txt
├── lusa_100.ann
├── lusa_100.txt
├── lusa_101.ann
└── ...
Models
- GPT: For the GPT models we use OpenAI python package and API.
- Falcon: The Falcon models have to be cloned to the
resources/models
directory:- git clone https://huggingface.co/tiiuae/falcon-7b-instruct
- git clone https://huggingface.co/tiiuae/falcon-7b
- git clone https://huggingface.co/tiiuae/falcon-180b
Launch inference endpoint
To deply locally the models under evaluation we used HuggingFace's text_generation_inference
. After installing all the deppencies needs to run the following command:
sudo docker run \
--gpus all \
--shm-size 1g \
-p 8080:80 \
-v $PWD/resources/models/:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id /data/falcon-40b \
--trust-remote-code
Results
The following table compares the results obtained from out-of-the-box baselines with the GPT models:
$P$ | $R$ | $F_{1}$ | $F_{1_{r}}$ | ||
---|---|---|---|---|---|
Timexs | TEI2GO | 50.7 | 59.4 | 54.7 | 55.7 |
HeidelTime | 50.1 | 59.3 | 54.2 | 56.3 | |
SRL | 23.6 | 35.7 | 28.4 | 50.5 | |
GPT-3 | 58.5 | 73.5 | 65.1 | 73.0 | |
ChatGPT | 34.5 | 58.6 | 43.5 | 48.0 | |
Participants | SRL | 27.1 | 20.7 | 23.4 | 47.7 |
GPT-3 | 35.9 | 35.8 | 35.8 | 51.0 | |
ChatGPT | 40.1 | 46.6 | 43.1 | 47.4 | |
Events | SRL | 67.1 | 40.4 | 50.4 | 83.5 |
TEFE | 95.4 | 9.1 | 16.6 | 95.6 | |
GPT-3 | 41.9 | 53.3 | 46.9 | 51.7 | |
ChatGPT | 50.7 | 44.3 | 47.2 | 61.4 |
The results show that the combination of the best prompts and GPT models outperform (using strict $F_1$) the baseline in the extraction of participants and time expressions but failed to reach the same results in the extraction of events.
Run Experiments
All the scripts to replicate the experimentation process are placed on the experiments
folder. This folder contains three scripts of great importance:
-
prompt_selection.py
which is used to assess the best prompt template for a given model. If one whiches to select the best prompt template for thegp4
model, one can do it by executing the following command:python experiments/prompt_selection.pt --mid gpt4
. -
After having the produced text by the model one has to parse the answers. This can be made by running the
parse.py
script with the following commandpython experiments/parse.py --mode prompt_selection
. This will produce apredictions.json
on theresults/prompt_selection
folder. -
Having the predictions we are now in a position to compare them with the annotations. By running
python experiments/evalaute.py --mode prompt_selection
one will produce aresults.json
on theresults/prompt_selection
folder that contains the predictions, recall, $F_1$ score, and relaxed $F_1$ score. -
From the values attained by the
evaluate.py
script one can assert what is the best template according to a given criterion. For our research, we consider the strict $F_1$ score. After identifying the best template one has to go to theconstants.py
on theexperiments
folder and add the best template for the model-entity pair. -
With that done we can now run the
experiments/test.py
script.
These scripts are meant to be executed recursively and in the presented order. That is, first, the best template is selected by executing the prompt_selection.py
script, and only after that, you should run the test.py
script.
To assert what is the list of models supported, the user can set the --help
flag on any of the scripts.
Running Experiments
The experiments
folder contains scripts to replicate the experimentation process:
-
Prompt Selection: To assess the best prompt template for a specific model (e.g.,
gpt4
), execute the following command:python experiments/prompt_selection.py -m gpt4
-
Parsing: After obtaining the model-generated text, parse the answers using the
parse.py
script:python experiments/parse.py --mode prompt_selection
-
Evaluation: To compare predictions with annotations, run the
evaluate.py
script:python experiments/evaluate.py --mode prompt_selection
-
Template Selection: Based on evaluation results, identify the best template and add it to the
constants.py
file in theexperiments
folder. -
Final Experiment: Run the
test.py
script to execute the final experiment.python experiments/test.py -m gpt4
These scripts should be executed sequentially in the order presented. You can use the --help
flag on any script to view supported options and commands.