

<div align= "center"> <h1> πŸ› οΈToolBenchπŸ€–</h1> </div> <div align="center">

</div> <p align="center"> <a href="#model">Model</a> β€’ <a href="#data">Data Release</a> β€’ <a href="#web-ui">Web Demo</a> β€’ <a href="#tool-eval">Tool Eval</a> β€’ <a href="https://arxiv.org/pdf/2307.16789.pdf">Paper</a> β€’ <a href="#citation">Citation</a> </p> </div> <div align="center"> <img src="https://cdn.discordapp.com/attachments/941582479117127680/1111543600879259749/20230526075532.png" width="350px"> </div>

πŸ”¨This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.

2024.8 Update We have updated the RapidAPI server with a new IP, please make sure you get the latest code. You can also build it locally using codes here.

What's New

✨Here is an overview of the dataset construction, training, and evaluation.

<br> <div align="center"> <img src="assets/overview.png" width="800px"> </div> <br>


<br> <div align="center"> <img src="assets/comparison.png" width="800px"> </div> <br>

We also provide A demo of using ToolLLaMA

<div align="center">



Currently, our ToolLLaMA has reached the performance of ChatGPT (turbo-16k) in tool use, in the future, we will continually improve the data quality and increase the coverage of real-world tools.

<div align="center"> <img src="assets/performance.png" width="300px"> </div>

πŸ‘ToolBench is intended solely for research and educational purposes and should not be construed as reflecting the opinions or views of the creators, owners, or contributors of this dataset. It is distributed under Apache License 2.0. Below is the statistics of the data :

Tool NumsAPI NumsInstance NumsReal API CallReasoning Traces

We crawl 16000+ real-world APIs from RapidAPI, and curate realistic human instructions that involve them. Below we present a hierarchy of RapidAPI and our instruction generation process.

<br> <div align="center"> <img src="assets/instructiongeneration.png" width="800px"> </div> <br>

ToolBench contains both single-tool and multi-tool scenarios. The multi-tool scenarios can be further categorized into intra-category multi-tool and intra-collection multi-tool. We utilize DFSDT method for all scenarios to our data creation. Here is an illustration for the data creation process using DFSDT method:

<div align="center"> <img src="assets/answer_anno.png" width="800px"> </div>

Data Release

Please download our dataset using the following link: Google Drive or Tsinghua Cloud. Notice that data_0801 is the old version data. The file structure is as follows:

β”œβ”€β”€ /data/
β”‚  β”œβ”€β”€ /instruction/
β”‚  β”œβ”€β”€ /answer/
β”‚  β”œβ”€β”€ /toolenv/
β”‚  β”œβ”€β”€ /retrieval/
β”‚  β”œβ”€β”€ /test_instruction/
β”‚  β”œβ”€β”€ /test_query_ids/
β”‚  β”œβ”€β”€ /retrieval_test_query_ids/
β”‚  β”œβ”€β”€ toolllama_G123_dfs_train.json
β”‚  └── toolllama_G123_dfs_eval.json
β”œβ”€β”€ /reproduction_data/
β”‚  β”œβ”€β”€ /chatgpt_cot/
β”‚  β”œβ”€β”€ /chatgpt_dfs/
β”‚  β”œβ”€β”€ ...
β”‚  └── /toolllama_dfs/

Here are some descriptions for the data directory:

Please make sure you have downloaded the necessary data and put the directory (e.g. data/) under ToolBench/, so that the following bash scripts can navigate to the related data.


We release the ToolLLaMA-2-7b-v2 which is trained on the latest version data, and ToolLLaMA-7b-v1, ToolLLaMA-7b-LoRA-v1 which are trained on the 0801 version data. All models are trained on the released dataset in a multi-task fashion. We also release the tool retriever trained under our experimental setting.



Clone this repository and navigate to the ToolBench folder.

git clone git@github.com:OpenBMB/ToolBench.git
cd ToolBench

Install Package (python>=3.9)

pip install -r requirements.txt

or for ToolEval only

pip install -r toolbench/tooleval/requirements.txt

Prepare the data and tool environment:

wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1XFjDxVZdUY7TXYF2yvzx3pJlS2fy78jk&confirm=yes' -O data.zip
unzip data.zip


Training Retriever

export PYTHONPATH=./
python preprocess/preprocess_retriever_data.py \
    --query_file data/instruction/G1_query.json \
    --index_file data/test_query_ids/G1_instruction_test_query_ids.json \
    --dataset_name G1 \
    --output_dir data/retrieval/G1
export PYTHONPATH=./
python toolbench/retrieval/train.py \
    --data_path data/retrieval/G1/ \
    --model_name bert-base-uncased \
    --output_path retrieval_model \
    --num_epochs 5 \
    --train_batch_size 32 \
    --learning_rate 2e-5 \
    --warmup_steps 500 \
    --max_seq_length 256

Training ToolLLaMA

export PYTHONPATH=./
python preprocess/preprocess_toolllama_data.py \
    --tool_data_dir data/answer/G1_answer \
    --method DFS_woFilter_w2 \
    --output_file data/answer/toolllama_G1_dfs.json
export PYTHONPATH=./
torchrun --nproc_per_node=2 --master_port=20001 toolbench/train/train_mem.py \
    --model_name_or_path huggyllama/llama-7b  \
    --data_path  data/toolllama_G123_dfs_train.json \
    --eval_data_path  data/toolllama_G123_dfs_eval.json \
    --conv_template tool-llama-single-round \
    --bf16 True \
    --output_dir toolllama \
    --num_train_epochs 2 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "epoch" \
    --prediction_loss_only \
    --save_strategy "epoch" \
    --save_total_limit 8 \
    --learning_rate 5e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --source_model_max_length 2048 \
    --model_max_length 8192 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to none

To train lora version:

export PYTHONPATH=./
deepspeed --master_port=20001 toolbench/train/train_lora.py \
    --model_name_or_path huggyllama/llama-7b  \
    --data_path  data/toolllama_G123_dfs_train.json \
    --eval_data_path  data/toolllama_G123_dfs_eval.json \
    --conv_template tool-llama-single-round \
    --bf16 True \
    --output_dir toolllama_lora \
    --num_train_epochs 5 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "epoch" \
    --prediction_loss_only \
    --save_strategy "epoch" \
    --save_total_limit 8 \
    --learning_rate 5e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --source_model_max_length 2048 \
    --model_max_length 8192 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --deepspeed ds_configs/stage2.json \
    --report_to none

Inference With Our RapidAPI Server

Please fill out the form first and after reviewing we will send you the toolbench key. Then prepare your toolbench key by:

export TOOLBENCH_KEY="your_toolbench_key"

For ToolLLaMA

To inference with ToolLLaMA, run the following commands:

export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model toolllama \
    --model_path ToolBench/ToolLLaMA-7b \
    --max_observation_length 1024 \
    --observ_compress_method truncate \
    --method DFS_woFilter_w2 \
    --input_query_file data/test_instruction/G1_instruction.json \
    --output_answer_file toolllama_dfs_inference_result \
    --toolbench_key $TOOLBENCH_KEY

For ToolLLaMA-LoRA:

export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model toolllama \
    --model_path huggyllama/llama-7b \
    --lora \
    --lora_path /path/to/your/downloaded/ToolLLaMA-7b-LoRA \
    --max_observation_length 1024 \
    --observ_compress_method truncate \
    --method DFS_woFilter_w2 \
    --input_query_file data/test_instruction/G1_instruction.json \
    --output_answer_file toolllama_lora_dfs_inference_result \
    --toolbench_key $TOOLBENCH_KEY

For ToolLLaMA-LoRA under open-domain setting, run:

export PYTHONPATH=./
python toolbench/inference/qa_pipeline_open_domain.py \
    --tool_root_dir data/toolenv/tools/ \
    --corpus_tsv_path data/retrieval/G1/corpus.tsv \
    --retrieval_model_path /path/to/your/retrival_model \
    --retrieved_api_nums 5 \
    --backbone_model toolllama \
    --model_path huggyllama/llama-7b \
    --lora \
    --lora_path /path/to/your/toolllama_lora \
    --max_observation_length 1024 \
    --observ_compress_method truncate \
    --method DFS_woFilter_w2 \
    --input_query_file data/test_instruction/G1_instruction.json \
    --output_answer_file toolllama_lora_dfs_open_domain_inference_result \
    --toolbench_key $TOOLBENCH_KEY

For OpenAI Models

To use ChatGPT, run:

export OPENAI_KEY=""
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model chatgpt_function \
    --openai_key $OPENAI_KEY \
    --max_observation_length 1024 \
    --method DFS_woFilter_w2 \
    --input_query_file data/test_instruction/G1_instruction.json \
    --output_answer_file chatgpt_dfs_inference_result \
    --toolbench_key $TOOLBENCH_KEY

To use Text-Davinci-003, run:

export OPENAI_KEY=""
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model davinci \
    --openai_key $OPENAI_KEY \
    --max_observation_length 1024 \
    --method DFS_woFilter_w2 \
    --input_query_file data/test_instruction/G1_instruction.json \
    --output_answer_file davinci_dfs_inference_result \
    --toolbench_key $TOOLBENCH_KEY

Inference With Your Own RapidAPI Account

To do inference with customized RapidAPI account, pass your rapidapi key through rapidapi_key and specify the use_rapidapi_key argument in the script:

export RAPIDAPI_KEY=""
export OPENAI_KEY=""
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model chatgpt_function \
    --openai_key $OPENAI_KEY \
    --max_observation_length 1024 \
    --method DFS_woFilter_w2 \
    --input_query_file data/test_instruction/G1_instruction.json \
    --output_answer_file chatgpt_dfs_inference_result \
    --rapidapi_key $RAPIDAPI_KEY \

API Customization

To do inference with customized API(s), you should prepare the API documentation and code, then modify your query. For example, to add an API hello_world which returns a "hello world" string:

    "tool_description": "Return hello world.",
    "tool_name": "hello world",
    "title": "hello world",
    "api_list": [
            "name": "get_hello_world",
            "url": "",
            "description": "To get 'hello world'.",
            "method": "GET",
            "required_parameters": [],
            "optional_parameters": []
    "standardized_name": "hello_world"

Then put it under a specific category in data/toolenv/tools/, either one of the 49 existing categories or a new category, e.g. Customized.

def get_hello_world():
    To get hello world 
    observation = "hello world"
    return observation

Now the file structure under data/toolenv/ should be:

β”œβ”€β”€ /tools/
β”‚  β”œβ”€β”€ /Sports/
β”‚  β”‚  β”œβ”€β”€ basketball.json
β”‚  β”‚  β”œβ”€β”€ /basketball/
β”‚  β”‚  β”‚  └── api.py
β”‚  β”‚  └── ...
β”‚  β”œβ”€β”€ ...
β”‚  β”œβ”€β”€ /Customized/
β”‚  β”‚  β”œβ”€β”€ hello_world.json
β”‚  β”‚  β”œβ”€β”€ /hello_world/
β”‚  β”‚  β”‚  └── api.py
└── response_examples
        "query": "I want to get a 'hello world' string.",
        "query_id": 200001,
        "api_list": [
                "category_name": "Customized",
                "tool_name": "hello world",
                "api_name": "get_hello_world"
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model toolllama \
    --model_path ToolBench/ToolLLaMA-7b \
    --max_observation_length 1024 \
    --observ_compress_method truncate \
    --method DFS_woFilter_w2 \
    --input_query_file /path/to/your/query/file \
    --output_answer_file /path/to/your/output/file \

Currently we only support customized API usage under close-domain setting. We plan to support open-domain soon.

Setting up and running the interface

ToolBench contains a Web UI based on Chatbot UI, forked to include the use of tools in the interface. It comes in two parts: the backend server, and chatbot-ui-toolllama. Here is a video demo.

Web UI

git clone https://github.com/lilbillybiscuit/chatbot-ui-toolllama
cd chatbot-ui-toolllama
npm install
npm run dev

The app will be available on http://localhost:3000/

Backend server

export PYTHONPATH=./
python toolbench/inference/toolbench_server.py \
    --tool_root_dir data/toolenv/tools/ \
    --corpus_tsv_path data/retrieval/G1/corpus.tsv \
    --retrieval_model_path /path/to/your/retrival_model \
    --retrieved_api_nums 5 \
    --backbone_model toolllama \
    --model_path huggyllama/llama-7b \
    --lora \
    --lora_path /path/to/your/toolllama_lora \
    --max_observation_length 1024 \
    --method DFS_woFilter_w2 \
    --input_query_file data/test_instruction/G1_instruction.json \
    --output_answer_file toolllama_lora_dfs_open_domain_result \
    --rapidapi_key $RAPIDAPIKEY

This server will be available on http://localhost:5000/. To start a request, call http://localhost:5000/stream with a GET or POST request containing a JSON object with the following fields:

    "text": "What is the weather in New York today?",
    "top_k": 5,
    "method": "DFS_woFilter_w2"


By fine-tuning LLaMA on ToolBench, we obtain ToolLLaMA. Considering that human evaluation can be time-consuming, we follow AlpacaEval to develop an efficient machine evaluator ToolEval, which incorporates two evaluation metrics:

To validate the reliability of ChatGPT evaluator in both pass rate and win rate, we sample among four different methods (ChatGPT+ReACT, ChatGPT+DFSDT, ToolLLaMA+DFSDT and GPT4+DFSDT) to obtain solution pairs for 300 test instructions for each method. Then we engage humans to annotate the pass rate for ChatGPT+DFSDT, ToolLLaMA+DFSDT and GPT4+DFSDT, and the win rate among ChatGPT+ReACT and ChatGPT+DFSDT. Our ChatGPT evaluator demonstrates a high agreement of 87.1% in pass rate and 80.3% in win rate with human annotators. This result shows that our evaluator generates highly similar evaluation results to humans and can be viewed as a credible evaluator who simulates human evaluation on pass rate and win rate.

More details about ToolEval can be found in our paper.

Evaluation with ToolEval


Install Package (python>=3.9)

pip install -r requirements.txt


If you want to reproduce the official results, download the reproduction data reproduction_data.zip through Google Drive, unzip it and put the reproduction_data under ToolBench/data/, and skip the data preparation process.

β”œβ”€β”€ /chatgpt_cot/
β”‚  β”œβ”€β”€ /G1_instruction/
β”‚  β”‚  β”œβ”€β”€ /10160_CoT@1.json
β”‚  β”‚  └── ...
β”‚  β”œβ”€β”€ /G1_tool/
β”‚  β”‚  β”œβ”€β”€ /10221_CoT@1.json
β”‚  β”‚  └── ...
β”‚  β”œβ”€β”€ ...
β”‚  β”œβ”€β”€ /G3_instruction/
β”‚  β”‚  β”œβ”€β”€ /10221_CoT@1.json
β”‚  β”‚  └── ...

Then preprocess the predictions by running the following commands:

export RAW_ANSWER_PATH=../../data/reproduction_data/model_predictions/
export CONVERTED_ANSWER_PATH=../../data/reproduction_data/model_predictions_converted/
export MODEL_NAME=chatgpt_cot
export METHOD=CoT
for test_set in G1_instruction G1_category G1_tool G2_category G2_instruction G3_instruction
    python convert_to_answer_format.py\
        --answer_dir ${answer_dir} \
        --method ${METHOD} \
        --output ${output_file}

After that, check if there are preprocessed json files for the test sets under ${CONVERTED_ANSWER_PATH}/${MODEL_NAME}. If so, you're ready to run the following evaluate process. If not, check if there is anything wrong with the model's predictions.

        "username": "your_user_name",
        "passwd": "your_password",
        "api_key": "your_openai_key",
        "organization": "your_organization"
export CONVERTED_ANSWER_PATH=../../data/reproduction_data/model_predictions_converted/
export SAVE_PATH=pass_rate_results
export CANDIDATE_MODEL=chatgpt_cot
export API_POOL_FILE=path/to/your/openai_key_json_file.json

python eval_pass_rate.py \
    --converted_answer_path ${CONVERTED_ANSWER_PATH} \
    --save_path ${SAVE_PATH} \
    --reference_model ${CANDIDATE_MODEL} \
    --test_ids ../../data/test_ids/ \
    --max_eval_threads 20 \
    --evaluate_times 7

The result files will be stored under the ${SAVE_PATH}.

export CONVERTED_ANSWER_PATH=../../data/reproduction_data/model_predictions_converted/
export SAVE_PATH=preference_results
export PASS_TARE_PATH=pass_rate_results
export REFERENCE_MODEL=chatgpt_cot
export CANDIDATE_MODEL=gpt-4-0613_cot
export API_POOL_FILE=path/to/your/openai_key_json_file.json

python eval_preference.py \
    --converted_answer_path ${CONVERTED_ANSWER_PATH} \
    --reference_model ${REFERENCE_MODEL} \
    --output_model ${CANDIDATE_MODEL} \
    --test_ids ../../data/test_ids/ \
    --save_path ${SAVE_PATH} \
    --pass_rate_result_path ${PASS_TARE_PATH} \
    --max_eval_threads 20 \
    --use_pass_rate true \
    --evaluate_times 7

The result files will be stored under the ${SAVE_PATH}.

Please refer to ToolEval for more details.

πŸ“Š Model Experiments Results

In our main experiments, ToolLLaMA(v2) demonstrates a compelling capability to handle both single-tool and complex multi-tool instructions, which on a par with ChatGPT. Below are the main results. Win rate for each model is compared with ChatGPT-ReACT.

Pass Rate:


Win Rate: (Reference model: ChatGPT-ReACT)



Resources of Tool Learning

With the powerful capabilities of foundation models, we are eager to see their applications in manipulating various tools. For more resources, please refer to the following:


Feel free to cite us if you like ToolBench.

      title={ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs}, 
      author={Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun},
      title={Tool Learning with Foundation Models}, 
      author={Yujia Qin and Shengding Hu and Yankai Lin and Weize Chen and Ning Ding and Ganqu Cui and Zheni Zeng and Yufei Huang and Chaojun Xiao and Chi Han and Yi Ren Fung and Yusheng Su and Huadong Wang and Cheng Qian and Runchu Tian and Kunlun Zhu and Shihao Liang and Xingyu Shen and Bokai Xu and Zhen Zhang and Yining Ye and Bowen Li and Ziwei Tang and Jing Yi and Yuzhang Zhu and Zhenning Dai and Lan Yan and Xin Cong and Yaxi Lu and Weilin Zhao and Yuxiang Huang and Junxi Yan and Xu Han and Xian Sun and Dahai Li and Jason Phang and Cheng Yang and Tongshuang Wu and Heng Ji and Zhiyuan Liu and Maosong Sun},
      title={StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models},
      author={Guo, Zhicheng and Cheng, Sijie and Wang, Hao and Liang, Shihao and Qin, Yujia and Li, Peng and Liu, Zhiyuan and Sun, Maosong and Liu, Yang},