Home

Awesome

Survey: Tool Learning with Large Language Models

Recently, tool learning with large language models(LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems.

This is the collection of papers related to tool learning with LLMs. These papers are organized according to our survey paper "Tool Learning with Large Language Models: A Survey".

中文: We have noticed that PaperAgent and 旺知识 have provided a brief and a comprehensive introduction in Chinese, respectively. We greatly appreciate their assistance.

:tada: Our survey paper is accepted by Frontiers of Computer Science (FCS). The latest version of our paper has already been released; please check it out!

Please feel free to contact us if you have any questions or suggestions!

Contribution

:tada::+1: Please feel free to open an issue or make a pull request! :tada::+1:

Citation

If you find our work helps your research, please kindly cite our paper:

@article{qu2024toolsurvey,
    author={Qu, Changle and Dai, Sunhao and Wei, Xiaochi and Cai, Hengyi and Wang, Shuaiqiang and Yin, Dawei and Xu, Jun and Wen, Ji-Rong},
    title={Tool Learning with Large Language Models: A Survey},
    journal={arXiv preprint arXiv:2405.17935},
    year={2024}
}

📋 Contents

🌟 Introduction

Recently, tool learning with large language models (LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems. Despite growing attention and rapid advancements in this field, the existing literature remains fragmented and lacks systematic organization, posing barriers to entry for newcomers. This gap motivates us to conduct a comprehensive survey of existing works on tool learning with LLMs. In this survey, we focus on reviewing existing literature from the two primary aspects (1) why tool learning is beneficial and (2) how tool learning is implemented, enabling a comprehensive understanding of tool learning with LLMs. We first explore the “why” by reviewing both the benefits of tool integration and the inherent benefits of the tool learning paradigm from six specific aspects. In terms of “how”, we systematically review the literature according to a taxonomy of four key stages in the tool learning workflow: task planning, tool selection, tool calling, and response generation. Additionally, we provide a detailed summary of existing benchmarks and evaluation methods, categorizing them according to their relevance to different stages. Finally, we discuss current challenges and outline potential future directions, aiming to inspire both researchers and industrial developers to further explore this emerging and promising area.

The overall workflow for tool learning with large language models.

<div align=center> <img src="assets/Framework.png" height="500"/> </div>

📄 Paper List

Why Tool Learning?

Benefit of Tools.

Benefit of Tool Learning.

How Tool Learning?

Task Planning.

Tool Selection.

Tool Calling.

Response Generation.

Benchmarks and Evaluation.

Benchmarks

BenchmarkReferenceDescription#Tools#InstancesLinkRelease Time
API-Bank[Paper]Assessing the existing LLMs’ capabilities in planning, retrieving, and calling APIs.73314[Repo]2023-04
APIBench[Paper]A comprehensive benchmark constructed from TorchHub, TensorHub, and HuggingFace API Model Cards.1,64516,450[Repo]2023-05
ToolBench1[Paper]A tool manipulation benchmark consisting of diverse software tools for real-world tasks.2322,746[Repo]2023-05
ToolAlpaca[Paper]Evaluating the ability of LLMs to utilize previously unseen tools without specific training.4263,938[Repo]2023-06
RestBench[Paper]A high-quality benchmark which consists of two real-world scenarios and human-annotated instructions with gold solution paths.94157[Repo]2023-06
ToolBench2[Paper]An instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT.16,464126,486[Repo]2023-07
MetaTool[Paper]A benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools.19921,127[Repo]2023-10
TaskBench[Paper]A benchmark designed to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction.10328,271[Repo]2023-11
T-Eval[Paper]Evaluating the tool-utilization capability step by step.15533[Repo]2023-12
ToolEyes[Paper]A fine-grained system tailored for the evaluation of the LLMs’ tool learning capabilities in authentic scenarios.568382[Repo]2024-01
UltraTool[Paper]A novel benchmark designed to improve and evaluate LLMs’ ability in tool utilization within real-world scenarios.2,0325,824[Repo]2024-01
API-BLEND[Paper]A large corpora for training and systematic testing of tool-augmented LLMs.-189,040[Repo]2024-02
Seal-Tools[Paper]Seal-Tools contains hard instances that call multiple tools to complete the job, among which some are nested tool callings.4,07614,076[Repo]2024-05
ToolQA[Paper]It is designed to faithfully evaluate LLMs’ ability to use external tools for question answering.(QA)131,530[Repo]2023-06
ToolEmu[Paper]A framework that uses a LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios.(Safety)311144[Repo]2023-09
ToolTalk[Paper]A benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue.(Conversation)2878[Repo]2023-11
VIoT[Paper]A benchmark include a training dataset and established performance metrics for 11 representative vision models, categorized into three groups using semi-automated annotations.(VIoT)111,841[Repo]2023-12
RoTBench[Paper]A multi-level benchmark for evaluating the robustness of LLMs in tool learning.(Robustness)568105[Repo]2024-01
MLLM-Tool[Paper]A system incorporating open-source LLMs and multimodal encoders so that the learnt LLMs can be conscious of multi-modal input instruction and then select the functionmatched tool correctly.(Multi-modal)93211,642[Repo]2024-01
ToolSword[Paper]A comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning.(Safety)100440[Repo]2024-02
SciToolBench[Paper]Spanning five scientific domains to evaluate LLMs’ abilities with tool assistance.(Sci-Reasoning)2,446856[Repo]2024-02
InjecAgent[Paper]A benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks.(Safety)171,054[Repo]2024-02
StableToolBench[Paper]A benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system.(Stable)16,464126,486[Repo]2024-03
m&m's[Paper]A benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, public APIs, and image processing modules.(Multi-modal)334,427[Repo]2024-03
GeoLLM-QA[Paper]A novel benchmark of 1,000 diverse tasks, designed to capture complex RS workflows where LLMs handle complex data structures, nuanced reasoning, and interactions with dynamic user interfaces.(Remote Sensing)1171,000[Repo]2024-04
ToolLens[Paper]ToolLens includes concise yet intentionally multifaceted queries that better mimic real-world user interactions. (Tool Retrieval)46418,770[Repo]2024-05
SoAyBench[Paper]A Solution-based LLM API-using Methodology for Academic Information Seeking7792[Repo], [HF]2024-05
ToolBH[Paper]A benchmark that assesses the LLM’s hallucinations through two perspectives: depth and breadth.-700[Repo]2024-06
ShortcutsBench[Paper]A Large-Scale Real-world Benchmark for API-based Agents14147627[Repo]2024-07
GTA[Paper]A Benchmark for General Tool Agents14229[Repo]2024-07
WTU-Eval[Paper]A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models4916[Repo]2024-07
AppWorld[Paper]A collection of complex everyday tasks requiring interactive coding with API calls457750[Repo]2024-07
ToolSandbox[Paper]A stateful, conversational and interactive tool-use benchmark.341032[Repo]2024-08
CToolEval[Paper]A benchmark designed to evaluate LLMs in the context of Chinese societal applications.27398[Repo]2024-08
NoisyToolBench[Paper]This benchmark includes a collection of provided APIs, ambiguous queries, anticipated questions for clarification, and the corresponding responses.-200[Repo]2024-09

Evaluation

Challenges and Future Directions

Other Resources