Home

Awesome

GTA: A Benchmark for General Tool Agents

<div align="center">

[ā¬‡ļø Dataset] [šŸ“ƒ Paper] [šŸŒ Project Page] [šŸ¤— Hugging Face]

</div>

šŸŒŸ Introduction

In developing general-purpose agents, significant focus has been placed on integrating large language models (LLMs) with various tools. This poses a challenge to the tool-use capabilities of LLMs. However, there are evident gaps between existing tool evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only inputs, which fail to reveal the agents' real-world problem-solving abilities effectively.

GTA is a benchmark to evaluate the tool-use capability of LLM-based agents in real-world scenarios. It features three main aspects:

<div align="center"> <img src="figs/dataset.jpg" width="800"/> </div>

The comparison of GTA queries with AI-generated queries is shown in the table below. The steps and tool types for queries in ToolBench and m&m's are explicitly stated, as marked in red and blue. The queries in APIBench are simple, only containing one step. Our GTA's queries are both step-implicit and tool-implicit.

<div align="center"> <img src="figs/implicit.jpg" width="800"/> </div>

šŸ“£ What's New

šŸ“š Dataset Statistics

GTA comprises a total of 229 questions. The basic dataset statistics is presented below. The number of tools involved in each question varies from 1 to 4. The steps to resolve the questions range from 2 to 8.

<div align="center"> <img src="figs/statistics.jpg" width="800"/> </div>

The detailed information of 14 tools are shown in the table below.

<div align="center"> <img src="figs/tools.jpg" width="800"/> </div>

šŸ† Leaderboard

We evaluate the language models in two modes:

Here is the performance of various LLMs on GTA. Inst, Tool, Arg, Summ, and Ans denote InstAcc, ToolAcc, ArgAcc SummAcc, and AnsAcc, respectively. P, O, L, C denote the F1 score of tool selection in Perception, Operation, Logic, and Creativity categories. Bold denotes the best score among all models. <ins>Underline</ins> denotes the best score under the same model scale. AnsAcc reflects the overall performance.

ModelsInstToolArgSummPOLCAns
šŸ’› API-based
gpt-4-1106-preview85.1961.4<ins>37.88</ins><ins>75</ins>67.6164.6174.7389.55<ins>46.59</ins>
gpt-4o<ins>86.42</ins><ins>70.38</ins>35.1972.77<ins>75.56</ins><ins>80</ins><ins>78.75</ins>82.3541.52
gpt-3.5-turbo67.6342.9120.8360.2458.9962.559.85<ins>97.3</ins>23.62
claude3-opus64.7554.417.5973.8141.6963.2346.4142.123.44
mistral-large58.9838.4211.1368.0319.1730.0526.8538.8917.06
šŸ’š Open-source
qwen1.5-72b-chat<ins>48.83</ins>24.96<ins>7.9</ins>68.712.4111.7621.165.13<ins>13.32</ins>
qwen1.5-14b-chat42.2518.856.2860.0619.9323.4<ins>39.83</ins>25.4512.42
qwen1.5-7b-chat29.777.360.1849.38013.9516.223610.56
mixtral-8x7b-instruct28.6712.030.3654.212.19<ins>34.69</ins>37.6842.559.77
deepseek-llm-67b-chat9.0523.340.1811.5114.7223.1922.2227.429.51
llama3-70b-instruct47.6<ins>36.8</ins>4.31<ins>69.06</ins><ins>32.37</ins>22.3736.4831.868.32
mistral-7b-instruct26.7510.05051.0613.7533.6635.5831.117.37
deepseek-llm-7b-chat10.5616.160.1818.2720.8115.2231.337.294
yi-34b-chat23.2310.77034.9911.611.7612.975.133.21
llama3-8b-instruct45.9511.31036.8819.0723.2329.83<ins>42.86</ins>3.1
yi-6b-chat21.2614.72032.541.4701.1800.58

šŸš€ Evaluate on GTA

Prepare GTA Dataset

  1. Clone this repo.
git clone https://github.com/open-compass/GTA.git
cd GTA
  1. Download the dataset from release file.
mkdir ./opencompass/data

Put it under the folder ./opencompass/data/. The structure of files should be:

GTA/
ā”œā”€ā”€ agentlego
ā”œā”€ā”€ opencompass
ā”‚   ā”œā”€ā”€ data
ā”‚   ā”‚   ā”œā”€ā”€ gta_dataset
ā”‚   ā”œā”€ā”€ ...
ā”œā”€ā”€ ...

Prepare Your Model

  1. Download the model weights.
pip install -U huggingface_hub
# huggingface-cli download --resume-download hugging/face/repo/name --local-dir your/local/path --local-dir-use-symlinks False
huggingface-cli download --resume-download Qwen/Qwen1.5-7B-Chat --local-dir ~/models/qwen1.5-7b-chat --local-dir-use-symlinks False
  1. Install LMDeploy.
conda create -n lmdeploy python=3.10
conda activate lmdeploy

For CUDA 12:

pip install lmdeploy

For CUDA 11+:

export LMDEPLOY_VERSION=0.4.0
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
  1. Launch a model service.
# lmdeploy serve api_server path/to/your/model --server-port [port_number] --model-name [your_model_name]
lmdeploy serve api_server ~/models/qwen1.5-7b-chat --server-port 12580 --model-name qwen1.5-7b-chat

Deploy Tools

  1. Install AgentLego.
conda create -n agentlego python=3.11.9
conda activate agentlego
cd agentlego
pip install -r requirements_all.txt
pip install agentlego
pip install -e .
mim install mmcv==2.1.0

Open ~/anaconda3/envs/agentlego/lib/python3.11/site-packages/transformers/modeling_utils.py, then set _supports_sdpa = False to _supports_sdpa = True in line 1279.

  1. Deploy tools for GTA benchmark.

To use the GoogleSearch and MathOCR tools, you should first get the Serper API key from https://serper.dev, and the Mathpix API key from https://mathpix.com/. Then export these keys as environment variables.

export SERPER_API_KEY='your_serper_key_for_google_search_tool'
export MATHPIX_APP_ID='your_mathpix_key_for_mathocr_tool'
export MATHPIX_APP_KEY='your_mathpix_key_for_mathocr_tool'

Start the tool server.

agentlego-server start --port 16181 --extra ./benchmark.py  `cat benchmark_toollist.txt` --host 0.0.0.0

Start Evaluation

  1. Install OpenCompass.
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
cd agentlego
pip install -e .
cd ../opencompass
pip install -e .
  1. Modify the config file at configs/eval_gta_bench.py as below.

The ip and port number of openai_api_base is the ip of your model service and the port number you specified when using lmdeploy.

The ip and port number of tool_server is the ip of your tool service and the port number you specified when using agentlego.

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        tool_server='http://10.140.0.138:16181',
        tool_meta='data/gta_dataset/toolmeta.json',
        batch_size=8,
    ),
]

If you infer and evaluate in step-by-step mode, you should comment out tool_server and enable tool_meta in configs/eval_gta_bench.py, and set infer mode and eval mode to every_with_gt in configs/datasets/gta_bench.py:

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        # tool_server='http://10.140.0.138:16181',
        tool_meta='data/gta_dataset/toolmeta.json',
        batch_size=8,
    ),
]
gta_bench_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template="""{questions}""",
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=AgentInferencer, infer_mode='every_with_gt'),
)
gta_bench_eval_cfg = dict(evaluator=dict(type=GTABenchEvaluator, mode='every_with_gt'))

If you infer and evaluate in end-to-end mode, you should comment out tool_meta and enable tool_server in configs/eval_gta_bench.py, and set infer mode and eval mode to every in configs/datasets/gta_bench.py:

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        tool_server='http://10.140.0.138:16181',
        # tool_meta='data/gta_dataset/toolmeta.json',
        batch_size=8,
    ),
]
gta_bench_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template="""{questions}""",
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=AgentInferencer, infer_mode='every'),
)
gta_bench_eval_cfg = dict(evaluator=dict(type=GTABenchEvaluator, mode='every'))
  1. Infer and evaluate with OpenCompass.
# infer only
python run.py configs/eval_gta_bench.py --max-num-workers 32 --debug --mode infer
# evaluate only
# srun -p llmit -q auto python run.py configs/eval_gta_bench.py --max-num-workers 32 --debug --reuse [time_stamp_of_prediction_file] --mode eval
srun -p llmit -q auto python run.py configs/eval_gta_bench.py --max-num-workers 32 --debug --reuse 20240628_115514 --mode eval
# infer and evaluate
python run.py configs/eval_gta_bench.py -p llmit -q auto --max-num-workers 32 --debug

šŸ“ Citation

If you use GTA in your research, please cite the following paper:

@misc{wang2024gtabenchmarkgeneraltool,
      title={GTA: A Benchmark for General Tool Agents}, 
      author={Jize Wang and Zerun Ma and Yining Li and Songyang Zhang and Cailian Chen and Kai Chen and Xinyi Le},
      year={2024},
      eprint={2407.08713},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.08713}, 
}