Awesome

LLM-SAP: Large Language Model Situational Awareness Based Planning

This is the official opensource for the paper LLM-SAP: Large Language Model Situational Awareness Based Planning.

The paper is available at arXiv.

Web page Navigation

LLM-SAP: Large Language Model Situational Awareness Based Planning
Dataset Introduction
Quick Start
- Generated Results
- Eval Results
Citation

Dataset Introduction

This dataset mainly includes hazardous scenarios from 24 home scenarios.
This includes a list of scenes, scenario events, planning complexity, detailed scenario descriptions, corresponding image descriptions, best human-written finite state machine demonstrations, and approximate visualization images of the finite state machine.
The detailed scene content can be viewed and used in CSV or JSON files under dataset.
In addition, we have prepared around 600 high-quality multimodal situational awareness planning datasets generated based on the SAP prompt method provided in this article, which can be used for fine-tuning GitHub or Hugging Face.

Quick Start

The dataset folder contains the main dataset content, including the test set 24_Home_Hazard_Scenario and training set Household_ Safety.
The finite state machine is generated by the test mentioned in the paper and the measurement results of the state machine are generated in the folder generated_FSM and eval_result.
Among them, regarding generated_FSM and eval_result, a detailed explanation is explained below.

Generated Results

Generate template

baseline prompting template
SAP prompting template
Comparation eval prompting template
Single round eval without feedback prompting template
Second round eval with feedback prompting template
Feedback prompting template
Ablation study 1 normal prompting template
Ablation study 1 SAP prompting template
Ablation study 2 Zero_shot_COT prompting template
Ablation study 2 EP05 prompting template
Ablation study 2 EP09 prompting template

GPT-4 generated results

GPT-4+SAP
GPT-4

GPT-3.5 generated results

GPT-3.5+SAP
(The demo is shown here)
GPT-3.5

Claude-2 generated results

Claude2+SAP
Claude2

Multi-agent test

GPT-3.5+SAP
GPT-3.5+SAP was selected as the test baseline.
Claude-2 eval
GPT-3.5+SAP would be evaluated according to the best demo in the corresponding scene and generated with feedback.
Regenerate with feedback
New FSM results would be generated by GPT-3.5 according to feedback.
Claude-2 eval new FSM
GPT-3.5+SAP+feedback would be evaluated with GPT-3.5 based on the best demo.
Or evaluated with GPT-4+SAP based on the GPT-3.5+SAP result.
(The whole demo is shown here)

Ablation test1 formatting generation

The format prompt is shown in the prompt template.
The Ablation tests are only generated by GPT-4.

GPT-4 format wtih SAP
Generation result here
GPT-4 format without SAP
Generation result here
Vicuna13b format with SAP
Only get the generation from Vicuna, not been evaluated. Generation result here
Vicuna13b format without SAP
Only get the generation from Vicuna, not been evaluated. Generation result here

Ablation test2 result

The result of ablation study1. Only evaluated by GPT-4.

GPT-4+Zero_shot_COT Generation

The prompt of Zero_shot_COT is: Let's think step by step.
Generation result here

GPT-4+EP05 Generation

The prompt of EP05 is: Are you sure that’s your final answer? It might be worth taking another look.
Generation result here

GPT-4+EP09 Generation

The prompt of EP09 is: Stay focused and dedicated to your goals. Your consistent efforts will lead to outstanding achievements.
Generation result here

Eval Results

The generated FSM are evaluated by GPT-4 and Claude-2. Please find the corresponding results in the folders.

Multi-agent SAP rank result

The initial result of GPT-3.5+SAP is used here in the first loop.
The second loop of the GPT-3.5+SAP+feedback result is shown here and here

Group of 6 rank result

The result of the ranking test by GPT-4 here and by Claude-2 here
In the result txt file, x_1 is GPT-4 with SAP, x_2 is GPT-4 without SAP, x_3 is GPT-3.5 with SAP, x_4 is GPT-3.5 without SAP, x_5 is Claude-2 with SAP, x_6 is Claude-2 without SAP.

Group of 4 rank result

GPT-4 & GPT-3.5

The result of the ranking test by GPT-4 here and by Claude-2 here
In the result txt file, x_1 is GPT-4 with SAP, x_2 is GPT-4 without SAP, x_3 is GPT-3.5 with SAP, and x_4 is GPT-3.5 without SAP.

GPT-4 & Claude-2

The result of the ranking test by GPT-4 here and by Claude-2 here
In the result txt file, x_1 is GPT-4 with SAP, x_2 is GPT-4 without SAP, x_3 is Claude-2 with SAP, and x_4 is Claude-2 without SAP.

Pairs rank result

Only recorded the evaluation of GPT-4.

Evaluate Claude-2

The result of the ranking test about Claude-2 here In the result txt file, x_3 is Claude-2 without SAP, x_4 is Claude-2 with SAP.

Evaluate GPT-3.5

The result of the ranking test about GPT-3.5 here In the result txt file, x_3 is GPT-3.5 without SAP, x_4 is GPT-3.5 with SAP.

Evaluate GPT-4

The result of the ranking test about GPT-4 here In the result txt file, x_3 is GPT-4 without SAP, x_4 is GPT-4 with SAP.

Ablation study 1 formatting result

The result of ablation study 1 evaluated by Claude-2 here In the result txt file, x_3 is GPT-4 without SAP and x_4 is GPT-4 with SAP.

Ablation study 2 result

Only evaluated by GPT-4.

GPT-4+SAP & GPT-4+Zero_shot_COT

The result of ablation study 2 here
In the result txt file, x_3 is GPT-4 with Zero_shot_COT and x_4 is GPT-4 with SAP.

GPT-4+SAP & GPT-4+EP05

The result of ablation study 2 here
In the result txt file, x_3 is GPT-4 with EP05 and x_4 is GPT-4 with SAP.

GPT-4+SAP & GPT-4+EP09

The result of ablation study 2 here
In the result txt file, x_3 is GPT-4 with EP09 and x_4 is GPT-4 with SAP.

Citation

@article{wang&zhong2024SAP_LLM,
      title={LLM-SAP: Large Language Model Situational Awareness Based Planning}, 
      author={Liman Wang and Hanyang Zhong},
      year={2024},
      eprint={2312.16127},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}