Home

Awesome

GPT Struct Me

This repository hosts the code used for the development of the "GPT Struct Me: Probing GPT Models on Narrative Entity Extraction" paper. The goal of this work was to evaluate the capabilities of generative large language models on the extraction of narrative entities from text. To accomplish that, we used GPT3 and ChatGPT models from OpenAI and evaluate them on the extraction of narrative entities, namely, participants, events, and temporal expressions (timexs). The results obtained indicate that GPT models are competitive with out-of-the-box baseline systems.

Setup

To use this repository, you need to have access to the OpenAI models API and provide your API key in a .env file. Use the following template for the .env file:

OPENAI_API_KEY="<your_api_key>"

Development Environment

Follow these steps to set up the development environment:

# Create and activate a virtual environment
virtualenv venv --python=python3.9
source venv/bin/activate

# Install required dependencies
pip install -r requirements.txt
pip install -e .

Data

The repository utilizes the Text2Story Lusa Corpus for experiments. To replicate the experiments, download the corpus and place it in a resources folder. The directory structure should resemble the following:

resources
└── lusa_news
    ├── lusa_0.ann
    ├── lusa_0.txt
    ├── lusa_100.ann
    ├── lusa_100.txt
    ├── lusa_101.ann
    └── ...

Models

Launch inference endpoint

To deply locally the models under evaluation we used HuggingFace's text_generation_inference. After installing all the deppencies needs to run the following command:

sudo docker run \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v $PWD/resources/models/:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /data/falcon-40b \
  --trust-remote-code 

Results

The following table compares the results obtained from out-of-the-box baselines with the GPT models:

$P$$R$$F_{1}$$F_{1_{r}}$
TimexsTEI2GO50.759.454.755.7
HeidelTime50.159.354.256.3
SRL23.635.728.450.5
GPT-358.573.565.173.0
ChatGPT34.558.643.548.0
ParticipantsSRL27.120.723.447.7
GPT-335.935.835.851.0
ChatGPT40.146.643.147.4
EventsSRL67.140.450.483.5
TEFE95.49.116.695.6
GPT-341.953.346.951.7
ChatGPT50.744.347.261.4

The results show that the combination of the best prompts and GPT models outperform (using strict $F_1$) the baseline in the extraction of participants and time expressions but failed to reach the same results in the extraction of events.

Run Experiments

All the scripts to replicate the experimentation process are placed on the experiments folder. This folder contains three scripts of great importance:

These scripts are meant to be executed recursively and in the presented order. That is, first, the best template is selected by executing the prompt_selection.py script, and only after that, you should run the test.py script.

To assert what is the list of models supported, the user can set the --help flag on any of the scripts.

Running Experiments

The experiments folder contains scripts to replicate the experimentation process:

  1. Prompt Selection: To assess the best prompt template for a specific model (e.g., gpt4), execute the following command:

    python experiments/prompt_selection.py -m gpt4
    
  2. Parsing: After obtaining the model-generated text, parse the answers using the parse.py script:

    python experiments/parse.py --mode prompt_selection
    
  3. Evaluation: To compare predictions with annotations, run the evaluate.py script:

    python experiments/evaluate.py --mode prompt_selection
    
  4. Template Selection: Based on evaluation results, identify the best template and add it to the constants.py file in the experiments folder.

  5. Final Experiment: Run the test.py script to execute the final experiment.

    python experiments/test.py -m gpt4
    

These scripts should be executed sequentially in the order presented. You can use the --help flag on any script to view supported options and commands.