Awesome
awesome-llm-tool-learning
A list of awesome papers on LLM tool learning.
Preliminary
ReAct: Synergizing Reasoning and Acting in Language Models [ICLR 2023][Code]
RRHF: Rank Responses to Align Language Models with Human Feedback without tears [NeurIPS 2023][Code]
Extending Context Window of Large Language Models via Positional Interpolation [Arxiv 2023][Code]
Survey
Tool Learning with Foundation Models [Arxiv][Code]
Papers
2023
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [Arxiv][Code]
ART: Automatic multi-step reasoning and tool-use for large language models [Arxiv][Code]
Gorilla: Large Language Model Connected with Massive APIs [Arxiv][Code]
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs [Arxiv][Code]
Large Language Models as Tool Makers [Arxiv][Code]
MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting [ACL 2023][Code]
Gentopia.AI: A Collaborative Platform for Tool-Augmented LLMs [EMNLP 2023][Code]
CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models [EMNLP Findings 2023][Code]
On the Tool Manipulation Capability of Open-source Large Language Models [Arxiv][Code]
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models [Arxiv]
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models [NeurIPS 2023][Code]
Toolformer: Language Models Can Teach Themselves to Use Tools [NeurIPS 2023][Code]
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [NeurIPS 2023][Code]
Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning [NeurIPS 2023][Code
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings [NeurIPS 2023][Code]
TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage [NeurIPS 2023 Workshop]
Making Language Models Better Tool Learners with Execution Feedback [Arxiv][Code]
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases [Arxiv][Code]
Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum [AAAI 2024][Code]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [ICLR 2024][Code]
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets [ICLR 2024][Code]
ToolDec: Syntax Error-Free and Generalizable Tool Use for LLMs via Finite-State Decoding [Arxiv][Code]
Identifying the Risks of LM Agents with an LM-Emulated Sandbox [Arxiv][Code]
ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [Arxiv]
Tool-Augmented Reward Modeling [Arxiv]
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving [Arxiv][Code]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Arxiv][Code]
RestGPT: Connecting Large Language Models with Real-World RESTful APIs [Arxiv]
Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use [Arxiv]
ControlLLM: Augment Language Models with Tools by Searching on Graphs [Arxiv][Code]
GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution [Arxiv][Code]
GitAgent: Facilitating Autonomous Agent with GitHub by Tool Extension [Arxiv]
AppAgent: Multimodal Agents as Smartphone Users [Arxiv][Code]
VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things [Arxiv]
Reverse Chain: A Generic-Rule for LLMs to Master Multi-API Planning [Arxiv][Code]
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update [Arxiv]
FARS: Fsm-Augmentation to Make LLMs Hallucinate the Right APIs [Arxiv]
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API [Arxiv]
Navigating Uncertainty: Optimizing API Dependency for Hallucination Reduction in Closed-Book Question Answering [Arxiv]
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [Arxiv][Code]
Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [Arxiv]
2024
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [Arxiv][Code]
Efficient Tool Use with Chain-of-Abstraction Reasoning [Arxiv]
AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls [Arxiv][Code]
ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval [Arxiv][Code]
ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph [Arxiv]
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs [Arxiv]
TOOLVERIFIER: Generalization to New Tools via Self-Verification [Arxiv][Code]
Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models [Arxiv]
LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [Arxiv][Code]
Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments [Arxiv][Code]
Equipping Language Models with Tool Use Capability for Tabular Data Analysis in Finance [Arxiv][Code]
Benchmark
(APIBench) Gorilla: Large Language Model Connected with Massive APIs [Arxiv][Code]
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [EMNLP][Code]
(ToolBench) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [Arxiv][Code]
ToolQA: A Dataset for LLM Question Answering with External Tools [NeurIPS 2023][Code]
MetaTool Benchmark: Deciding Whether to Use Tools and Which to Use [Arxiv][Code]
T-Eval: Evaluating the Tool Utilization Capability Step by Step [Arxiv][Code]
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback [Arxiv][Code]
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios [Arxiv][Code]
A Comprehensive Evaluation of Tool-Assisted Generation Strategies [EMNLP Findings 2023]
ToolTalk: Evaluating Tool-Usage in a Conversational Setting [Arxiv][Code]
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks [Arxiv][Code]
RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning [Arxiv][Code]
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [Arxiv][Code]
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages [Arxiv][Code]
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models [Arxiv]Code]
m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks [Arxiv][Code]