Home

Awesome

PEARL

This is the repository for our paper Prompting Large Language Models to Plan and Execute Actions Over Long Documents.

Setup

Please make sure openai package is installed, and the API key has been exported to env variable OPENAI_API_KEY_OAI.

Preprocess Data

  1. Download QuALITY data and unzip to ./data/raw folder
  2. Run python data_preproc.py. This step produces two files in data/processed folder: quality_dev_q.csv and quality_train_q.csv

Action Mining

PEARL mines actions from data of similar distribution (in this repo, the training data of the QuALITY dataset) instead of assuming a pre-defined action space. To mine the actions from the training set, run

bash ./script.sh action_mining

The above command will return file ./output/mined_actions_init.txt which stores the actions in the format of

ANALYZE(CTX, X, Y) #Analyze the relationship, attitude, or feelings between X and Y, or the character, language, tone, or symbolism of X given the input CTX.

Notice that we find the generation process is not entirely deterministic even after setting both temperature and top_p to 0. We provide examples of mined actions in output/mined_actions_init_example.txt.

In our experiments, the total length of all mined actions exceeds the maximum context length of GPT-4-8k, thus we added a step to simplify the mined actions:

bash ./script.sh action_simplification

Example actions simplified from output/mined_actions_init_example.txt are provided in output/mined_actions_simplified_example.txt. The number of actions can be adjusted via going through multiple rounds of action simplification. More details are included in our paper Section 4.1.

Plan Generation and Execution with PEARL

We evaluate PEARL on a subset of QuALITY questions that are annotated requiring long context to answer. For both baselines and PEARL, the output will be stored in ./output folder following the format {prompt_type}_out.{split}.{ctx_type}.csv. The {split} and {ctx_type} denote placeholder for the original QuALITY split (train or dev) from which we extract the example, and the context size required to answer the question respectively.

To run the multiple-choice question baseline, run

bash ./script.sh baseline_mcq

We provide output in ./output/baseline_mcq_out.{split}.{ctx_type}.csv

To run the free-form answer baseline, run

bash ./script.sh baseline_gqa

We provide output in ./output/baseline_gqa_out.{split}.{ctx_type}.csv

To run PEARL on the challenge subset of QuALITY, run

bash ./script.sh pearl

For PEARL, two files are generated:

  1. The .csv file that contains the plan and answer and mapped answer with field names as follows:
  1. The output .pkl file that stores the intermediate output where the keys are the output variables in the plan, and values are the executed results assigned to the output variables.

We provide example .csv output of two runs with gpt-4-0314 checkpoint in ./output/pearl_out.{split}.{ctx_type}.csv, as well as the intermediate output of one run in .pkl file.

To see the intermediate step output, run the the command in ./script.sh with --debug. Example output for executing one action is shown below: the parsed action and the executed output.

{'action': 'FIND_RELATION', 'args': ['CTX', '"Ro"', '"mother"'], 'output_var': 'ro_mother', 'detailed_action': 'Find and summarize the relationship between Ro and his mother in the input article'}
In the input article, Ro is a young Martian who has returned to his home ... The relationship between Ro and his mother seems to be one of respect and learning, as he remembers her words and uses them to navigate the challenges he faces.

Note that the code currently uses the provided examples in the prompt_bank for plan generation. To generate demonstration with GPT-4 along with self-refinement, run

bash ./script.sh refine

The generated demonstrations will be printed out, and can be later incorporated into prompt_bank/plan_gen.txt.

To compute the mapped answer accuracy for each method, run

python comp_acc.py baseline_mcq_out
# File: ./output/baseline_mcq_out.dev.ctx_eval_long.csv, accuracy: 81.2
# File: ./output/baseline_mcq_out.dev.ctx_eval_short.csv, accuracy: 84.4
# File: ./output/baseline_mcq_out.train.ctx_eval_long.csv, accuracy: 71.7
# Total accuracy: 78.7

python comp_acc.py baseline_gqa_out
# File: ./output/baseline_gqa_out.dev.ctx_eval_long.csv, accuracy: 71.5
# File: ./output/baseline_gqa_out.dev.ctx_eval_short.csv, accuracy: 79.1
# File: ./output/baseline_gqa_out.train.ctx_eval_long.csv, accuracy: 57.9
# Total accuracy: 68.8

python comp_acc.py pearl_out
# File: ./output/pearl_out.dev.ctx_eval_long.csv, accuracy: 77.4
# File: ./output/pearl_out.dev.ctx_eval_short.csv, accuracy: 76.7
# File: ./output/pearl_out.train.ctx_eval_long.csv, accuracy: 63.8
# Total accuracy: 72.2

Cite

@misc{sun2023pearl,
      title={PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents}, 
      author={Simeng Sun and Yang Liu and Shuohang Wang and Chenguang Zhu and Mohit Iyyer},
      year={2023},
      eprint={2305.14564},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}