Home

Awesome

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

✨Introduction

The recent trend of using Large Language Models (LLMs) as tool agents in real-world applications underscores the necessity for comprehensive evaluations of their capabilities, particularly in complex scenarios involving planning, creating, and using tools. However, existing benchmarks typically focus on simple synthesized queries that do not reflect real-world complexity, thereby offering limited perspectives in evaluating tool utilization. To address this issue, we present UltraTool, a novel benchmark designed to improve and evaluate LLMs' ability in tool utilization within real-world scenarios. UltraTool focuses on the entire process of using tools - from planning and creating to applying them in complex tasks. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage and simplifies the task solving by mapping out the intermediate steps. Thus, unlike previous work, it eliminates the restriction of pre-defined toolset during planning. Through extensive experiments on various LLMs, we offer novel insights into the evaluation of capabilities of LLMs in tool utilization, thereby contributing a fresh perspective to this rapidly evolving field.

data_construction

data_construction

πŸš€What's New

πŸ“‚Folders

The repository is structured as follows:

β”œβ”€β”€ data/ # data
β”‚   β”œβ”€β”€ Chinese-dataset/ # data in Chinese
β”‚   β”‚   β”œβ”€β”€ example/ # few-shot examples when inference
β”‚   β”‚   β”‚   β”œβ”€β”€ planning.json 
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”œβ”€β”€ test_set/ # test set
β”‚   β”‚   β”‚   β”œβ”€β”€ planning.json
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”œβ”€β”€ test.json # test data that used for constructing the test set
β”‚   β”‚   └── dev.json # optional development set
β”‚   β”œβ”€β”€ English-dataset/ # data in English
β”‚   β”‚   β”œβ”€β”€ example/ 
β”‚   β”‚   β”‚   β”œβ”€β”€ planning.json
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”œβ”€β”€ test_set/ 
β”‚   β”‚   β”‚   β”œβ”€β”€ planning.json
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”œβ”€β”€ test.json 
β”‚   β”‚   └── dev.json 
β”œβ”€β”€ evaluation/ # evaluation scripts
β”œβ”€β”€ inference/ # inference scripts
β”œβ”€β”€ Predictions/
β”‚   β”œβ”€β”€ Chinese-dataset/ # prediction results on Chinese-dataset
β”‚   β”‚   β”œβ”€β”€ gpt-3.5/ # prediction results of gpt-3.5
β”‚   β”‚   β”‚   β”œβ”€β”€ eval/ 
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ inferback/ # temporary results
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ planning_eval.json # gpt-4 evaluation results for planning
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ tool_creation_eval.json # gpt-4 evaluation results for tool creation
β”‚   β”‚   β”‚   β”‚   └── tool_creation_post_process.json # post-process results for tool creation
β”‚   β”‚   β”‚   β”œβ”€β”€ inferback/ # temporary results
β”‚   β”‚   β”‚   β”œβ”€β”€ planning.json # prediction results for planning
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ English-dataset/ # prediction results on English-dataset
β”‚   β”‚   β”œβ”€β”€ gpt-3.5/
β”‚   β”‚   β”‚   β”œβ”€β”€ eval/ 
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ inferback/ 
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ planning_eval.json 
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ tool_creation_eval.json 
β”‚   β”‚   β”‚   β”‚   └── tool_creation_post_process.json 
β”‚   β”‚   β”‚   β”œβ”€β”€ inferback/ 
β”‚   β”‚   β”‚   β”œβ”€β”€ planning.json 
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”œβ”€β”€ ...

πŸ› οΈQuick start

Preparations

$ git clone https://github.com/JoeYing1019/UltraTool.git
$ cd UltraTool
$ pip install requirements.txt

Inference

Closed-source LLMs

We offer the inference code for both GPT-3.5 and GPT-4.

# inference gpt-3.5 on Chinese-dataset of all six tasks
python inference/inference_openai.py --model gpt-3.5 --language ch --tasks ['planning', 'tool_usage_awareness', 'tool_creation', 'tool_usage', 'tool_creation_awareness', 'tool_selection']

# inference gpt-3.5 on English-dataset of all six tasks
python inference/inference_openai.py --model gpt-3.5 --language en --tasks ['planning', 'tool_usage_awareness', 'tool_creation', 'tool_usage', 'tool_creation_awareness', 'tool_selection']

# inference gpt-4 on Chinese-dataset of all six tasks
python inference/inference_openai.py --model gpt-4 --language ch --tasks ['planning', 'tool_usage_awareness', 'tool_creation', 'tool_usage', 'tool_creation_awareness', 'tool_selection']

# inference gpt-4 on English-dataset of all six tasks
python inference/inference_openai.py --model gpt-4 --language en --tasks ['planning', 'tool_usage_awareness', 'tool_creation', 'tool_usage', 'tool_creation_awareness', 'tool_selection']

The inference results will be saved in predictions and we have provided the inference results of GPT-3.5 and GPT-4 in predictions.

Open-source LLMs

We choose ChatGLM3 as an example to explain the whole process. To evaluate the ChatGLM3 model using the UltraTool benchmarks, follow these steps:

1. Download model and set the model path

For instance, when employing ChatGLM3, acquire the model from Hugging Face. Subsequently, navigate to the inference/inference_ultraltool.py script and assign the appropriate value to args.model_path by specifying <path_to_your_local_chatglm_model>. Ensure to replace <path_to_your_local_chatglm_model> with the precise directory path leading to your ChatGLM3 model.

2. Update the run.sh Script

At line 9 in run.sh, modify the model_types array:

model_types=(chatglm)
3. Execute the Evaluation

Run the run.sh script:

bash scripts/run.sh
4. Accessing the Results

After running the script, you can find the inference results of ChatGLM3 for both Chinese-dataset and English-dataset on all tasks in the UltraTool benchmarks. The results are located under:

predictions/Chinese-dataset/chatglm
predictions/English-dataset/chatglm

Evaluation

Planning

The evaluation of planning relies on GPT-4, with the evaluation pipeline structured as follows:

# evaluate gpt-3.5 and gpt-4 on Chinese-dataset
python evaluation/inference_planning_eval.py --models ['gpt-3.5', 'gpt-4'] --language ch
python evaluation/planning.py --models ['gpt-3.5', 'gpt-4'] --language ch

# evaluate gpt-3.5 and gpt-4 on English-dataset
python evaluation/inference_planning_eval.py --models ['gpt-3.5', 'gpt-4'] --language en
python evaluation/planning.py --models ['gpt-3.5', 'gpt-4'] --language en

Tool creation

Similar to planning evaluation, the evaluation of tool creation also depends on GPT-4. However, it requires an additional post-processing step. The evaluation pipeline is as follows:

# evaluate gpt-3.5 and gpt-4 on Chinese-dataset
python evaluation/post_process_tool_creation.py --models ['gpt-3.5', 'gpt-4'] --language ch
python evaluation/inference_tool_creation_eval.py --models ['gpt-3.5', 'gpt-4'] --language ch
python evaluation/tool_creation.py --models ['gpt-3.5', 'gpt-4'] --language ch

# evaluate gpt-3.5 and gpt-4 on English-dataset
python evaluation/post_process_tool_creation.py --models ['gpt-3.5', 'gpt-4'] --language en
python evaluation/inference_tool_creation_eval.py --models ['gpt-3.5', 'gpt-4'] --language en
python evaluation/tool_creation.py --models ['gpt-3.5', 'gpt-4'] --language en

Tool creation awareness

# evaluate gpt-3.5 and gpt-4 on Chinese-dataset
python evaluation/tool_creation_awareness.py --models ['gpt-3.5', 'gpt-4'] --language ch

# evaluate gpt-3.5 and gpt-4 on English-dataset
python evaluation/tool_creation_awareness.py --models ['gpt-3.5', 'gpt-4'] --language en

Tool usage awareness

# evaluate gpt-3.5 and gpt-4 on Chinese-dataset
python evaluation/tool_usage_awareness.py --models ['gpt-3.5', 'gpt-4'] --language ch

# evaluate gpt-3.5 and gpt-4 on English-dataset
python evaluation/tool_usage_awareness.py --models ['gpt-3.5', 'gpt-4'] --language en

Tool selection

# evaluate gpt-3.5 and gpt-4 on Chinese-dataset
python evaluation/tool_selection.py --models ['gpt-3.5', 'gpt-4'] --language ch

# evaluate gpt-3.5 and gpt-4 on English-dataset
python evaluation/tool_selection.py --models ['gpt-3.5', 'gpt-4'] --language en

Tool usage

# evaluate gpt-3.5 and gpt-4 on Chinese-dataset
python evaluation/tool_usage.py --models ['gpt-3.5', 'gpt-4'] --language ch

# evaluate gpt-3.5 and gpt-4 on English-dataset
python evaluation/tool_usage.py --models ['gpt-3.5', 'gpt-4'] --language en

πŸ“ˆBenchmark Results

results

✏️ Citation

If you find this project useful in your research, please cite:

@misc{huang2024planning,
      title={Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios}, 
      author={Shijue Huang and Wanjun Zhong and Jianqiao Lu and Qi Zhu and Jiahui Gao and Weiwen Liu and Yutai Hou and Xingshan Zeng and Yasheng Wang and Lifeng Shang and Xin Jiang and Ruifeng Xu and Qun Liu},
      year={2024},
      eprint={2401.17167},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}