Awesome

awesome-llm-tool-learning

A list of awesome papers on LLM tool learning.

Preliminary

ReAct: Synergizing Reasoning and Acting in Language Models [ICLR 2023][Code]

RRHF: Rank Responses to Align Language Models with Human Feedback without tears [NeurIPS 2023][Code]

Extending Context Window of Large Language Models via Positional Interpolation [Arxiv 2023][Code]

Survey

Tool Learning with Foundation Models [Arxiv][Code]

Papers

2023

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [Arxiv][Code]

ART: Automatic multi-step reasoning and tool-use for large language models [Arxiv][Code]

Gorilla: Large Language Model Connected with Massive APIs [Arxiv][Code]

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs [Arxiv][Code]

Large Language Models as Tool Makers [Arxiv][Code]

MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting [ACL 2023][Code]

Gentopia.AI: A Collaborative Platform for Tool-Augmented LLMs [EMNLP 2023][Code]

CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models [EMNLP Findings 2023][Code]

On the Tool Manipulation Capability of Open-source Large Language Models [Arxiv][Code]

Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models [Arxiv]

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models [NeurIPS 2023][Code]

Toolformer: Language Models Can Teach Themselves to Use Tools [NeurIPS 2023][Code]

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [NeurIPS 2023][Code]

Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning [NeurIPS 2023][Code

ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings [NeurIPS 2023][Code]

TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage [NeurIPS 2023 Workshop]

Making Language Models Better Tool Learners with Execution Feedback [Arxiv][Code]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases [Arxiv][Code]

Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum [AAAI 2024][Code]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [ICLR 2024][Code]

CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets [ICLR 2024][Code]

ToolDec: Syntax Error-Free and Generalizable Tool Use for LLMs via Finite-State Decoding [Arxiv][Code]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox [Arxiv][Code]

ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [Arxiv]

Tool-Augmented Reward Modeling [Arxiv]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving [Arxiv][Code]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Arxiv][Code]

RestGPT: Connecting Large Language Models with Real-World RESTful APIs [Arxiv]

Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use [Arxiv]

ControlLLM: Augment Language Models with Tools by Searching on Graphs [Arxiv][Code]

GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution [Arxiv][Code]

GitAgent: Facilitating Autonomous Agent with GitHub by Tool Extension [Arxiv]

AppAgent: Multimodal Agents as Smartphone Users [Arxiv][Code]

VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things [Arxiv]

Reverse Chain: A Generic-Rule for LLMs to Master Multi-API Planning [Arxiv][Code]

CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update [Arxiv]

FARS: Fsm-Augmentation to Make LLMs Hallucinate the Right APIs [Arxiv]

Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API [Arxiv]

Navigating Uncertainty: Optimizing API Dependency for Hallucination Reduction in Closed-Book Question Answering [Arxiv]

EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [Arxiv][Code]

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [Arxiv]

2024

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [Arxiv][Code]

Efficient Tool Use with Chain-of-Abstraction Reasoning [Arxiv]

AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls [Arxiv][Code]

ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval [Arxiv][Code]

ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph [Arxiv]

API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs [Arxiv]

TOOLVERIFIER: Generalization to New Tools via Self-Verification [Arxiv][Code]

Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models [Arxiv]

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [Arxiv][Code]

Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments [Arxiv][Code]

Equipping Language Models with Tool Use Capability for Tabular Data Analysis in Finance [Arxiv][Code]

Benchmark

(APIBench) Gorilla: Large Language Model Connected with Massive APIs [Arxiv][Code]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [EMNLP][Code]

(ToolBench) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [Arxiv][Code]

ToolQA: A Dataset for LLM Question Answering with External Tools [NeurIPS 2023][Code]

MetaTool Benchmark: Deciding Whether to Use Tools and Which to Use [Arxiv][Code]

T-Eval: Evaluating the Tool Utilization Capability Step by Step [Arxiv][Code]

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback [Arxiv][Code]

ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios [Arxiv][Code]

A Comprehensive Evaluation of Tool-Assisted Generation Strategies [EMNLP Findings 2023]

ToolTalk: Evaluating Tool-Usage in a Conversational Setting [Arxiv][Code]

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks [Arxiv][Code]

RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning [Arxiv][Code]

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [Arxiv][Code]

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages [Arxiv][Code]

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models [Arxiv]Code]

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks [Arxiv][Code]