Awesome
ProTrix
Code and data for ProTrix: Building Models for Planning and Reasoning over Tables with Sentence Context
Introduction
We propose a Plan-then-Reason framework to answer user queries on tabular tasks with sentence context. The framework first plans the reasoning pathways by ingesting the query and the context, and assigns each step to textual and program-based reasoning to arrive at the final answer. We construct TrixInstruct, an instruction-tuning set to build models with generalizability and interpretability over tables with sentence context. We develop ProTrix, capable of planning and reasoning on tabular tasks with sentence context. ProTrix can generalize to unseen tasks generate accurate and faithful explanations.
Updates
[2024/10/04] Our paper is accpeted by EMNLP2024 Findings!
[2024/07/07] We update the prediction results for ProTrix family. You can use them to reproduce our results or run your own evaluation.
[2024/07/02] We add code for in-context learning experiments with our framework. Check the details in our paper!
[2024/05/21] We release our finetuned models on huggingface. We also introduce a new model Llama-3-ProTrix based on Llama-3-8B!
Models
You can download our models ProTrix, ProTrix-Coder and Llama-3-ProTrix on huggingface. The latest Llama-3-ProTrix is finetuned using Xtuner.
Environments
conda create --name protrix --file requirements.txt
We use vllm to speed up the inference.
conda activate protrix
├── data
│ ├── evaluation_data
│ ├── gpt_output
│ └── TrixInstruct.json
├── evaluation
│ ├── compute_score.py
│ ├── evaluate_with_llm.py
│ ├── generate_all_responses.sh
│ ├── sql_tool.py
│ └── generate_with_sql.py
├── prompt
│ ├── one_step.py
│ ├── prompt.py
│ └── reason_after_sql.py
├── src
│ ├── utils
| ├── eval_icl.py
│ ├── plan_then_reason.py
│ └── parse_output.py
├── README.md
└── requirement.txt
Finetuning with Plan-then-Reason Framework
Data Format of TrixInstruct
You can also download our dataset through Huggingface
{
'id': instance id from the original dataset,
'instruction': instruction for generation the responses,
'output': response obtain from GPT-4,
'answer': gold answer from the original dataset
}
Inference
You can run the following command to generate result for a specific benchmark. We splits data on different GPUs to speed up the process with multiprocessing. The result will be saved at data/outputs/{model_name}. Use empty template for ProTrix and ProTrix-Coder, and llama3 template for Llama-3-ProTrix.
CUDA_VISIBLE_DEVICE={} python evaluation/generate_with_sql.py --model_path {your_path_to_protrix} --benchmark {benchmark} --template {template}
Or you can run the following command to generate for all the benchmarks in data/evaluation_data
cd evaluation
sh generate_all_responses.sh {protrix_path} {device}
This script will generate an approximation of the evaluation result. Note the template is set as llama3 by default. You can change it to empty for ProTrix and ProTrix-Coder.
Evaluation
Since our model is not trained to follow the rule or grammar of each dataset, we provide two methods to compute the final score.
Heuristic Matching Method
We try to match the answer in the concluding sentence. This method is not perfect but there is only a tiny proporation of mismatch based on our human evaluation. We use this method for developing our model. We use this metric to report the final result of fact verification tasks in our paper.
python evaluation/compute_score.py --benchmark {benchmark} --model_path {model_path}
LLM Method
We employ Llama-2-70B-chat to access the correctness of the final answer. We use this method to report the final result of question answering tasks in our paper.
python evaluate_with_llm.py --benchmark {benchmark} --model_path {model_path} --llm_path {llm_path}
python compute_score.py --benchmark {benchmark} --model_path {model_path} --eval_with_llm
In-Context Learning Method
You can run the following code to get responses from OpenAI APIs. To make fair comparisons, we use normalized data from previous SOTA TabSQLify.
python src/plan_then_reason.py --model_name {model_name} --benchmark {benchmark} --step one_step --run
You can replace --run
with --dry_run
to test your prompt.
You can add argument --end $num
to test code on first $num
instances.
python src/parse_output.py --model_name {model_name} --benchmark {benchmark} --step one_step
python src/plan_then_reason.py --model_name {model_name} --benchmark {benchmark} --step one_step_result --run
python src/parse_output.py --model_name {model_name} --benchmark {benchmark} --step one_step_result
Now you can run the following code to evaluate the results of in-context learning. We use the same evaluator from Binder to make fair comparison with existing prompting methods.
python src/eval_icl.py --model_name {model_name} --benchmark {benchmark} --one_step
Citation
@inproceedings{wu-feng-2024-protrix,
title = "{P}ro{T}rix: Building Models for Planning and Reasoning over Tables with Sentence Context",
author = "Wu, Zirui and
Feng, Yansong",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.253",
pages = "4378--4406"
}
License
Check out the license for our models. Our curated dataset is under the MIT license.
Acknowledgement
Our implementation for in-context learning is based on Binder and TabSQLify. We thank Rami Aly and Md Mahadi Hasan Nahid for helpful discussions.
Contact
If you have any questions or want to discuss future research directions, feel free to email ziruiwu@pku.edu.cn.