Home

Awesome

Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Description

This repo contains a diagnostic evaluation benchmark toward the robustness of text-to-SQL models, which contains 17 perturbation test sets to measure the robustness of models from different angles. It is released along with our ICLR 2023 paper: Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness. The details can be found in our paper.

The dataset is created using the dev set in the Spider dataset and our changes to the Spider dataset are to supplement the work done by Spider.

Preprocessing

First, unzip the data using the following command.

mkdir data
tar -xvf data.tar.gz -C data

Run data_preprocess.py to copy pre-perturbed databases and tables from the original spider devlopment set.

python data_preprocess.py

To Use

Each folder contains a perturbation test set. There are 3 DB perturbation test sets (starting with DB_), 9 NLQ perturbation test sets (starting with NLQ_), and 5 SQL perturbation test sets (starting with SQL_). Each test contains parallel pre-perturbation and post-perturbation test data.

First, run the model on Spider-dev set to get the predicted SQL queries and put it in predictions/Spider-dev/[model_name]/pred.sql. Then, run the model on each post-perturbation set to get the predicted SQL queries in predictions/[perturbation_namq]/[model_name]/pred.sql.

To Evaluate a Model

Run copy_pre_perturbation_predictions.py to generate copy the SQL prediction in Spider-dev to all pre-perturbation sets. Evalaute the model on each pre-perturbation and post-perturbation set using the test-suite evaluation.

python copy_pre_perturbation_predictions.py --model [model_name]

Leaderboard

Pre-perturbation and post-perturbation accuracy in terms of execution (EX)

The EX accuracy of models on pre-perturbation and post-perturbation data. We report the marco average results of the perturbation test sets in DB, NLQ, SQL sets. x-y represents the accuracy on pre-perturbation data and post-perturbation data.

Evaluation of Finetuned Models

ModelAverage of DB perturbation test setsAverage of NLQ perturbation test setsAverage of SQL perturbation test setsAverage of all test sets
Picard78.9-55.076.0-65.076.3-74.076.6-65.9
SmBoP74.7-50.076.6-58.174.7-72.275.7-60.8
T5-3B LK73.5-47.070.4-58.971.7-69.671.3-59.9
T5-3B69.5-42.968.2-54.970.9-69.569.2-57,1
T5-large64.0-36.763.6-50.965.6-64.764.2-54.2
RatSQL70.8-33.970.2-50.768.8-62.469.9-51.5
T5-base51.1-22.850.0-32.656.9-51.854.3-40.6

Evaluation of In-context Learning Methods

ModelAverage of DB perturbation test setsAverage of NLQ perturbation test setsAverage of SQL perturbation test setsAverage of all test sets
Codex72.6-60.775.3-60.874.6-73.174.6-64.4

Citation and Contact

If you use the dataset in your work, please cite our paper and the Spider paper.

@article{chang2023dr,
  title={Dr. Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness},
  author={Chang, Shuaichen and Wang, Jun and Dong, Mingwen and Pan, Lin and Zhu, Henghui and Li, Alexander Hanbo and Lan, Wuwei and Zhang, Sheng and Jiang, Jiarong and Lilien, Joseph and others},
  journal={arXiv preprint arXiv:2301.08881},
  year={2023}
}

@inproceedings{yu2018spider,
  title={Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
  author={Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages={3911--3921},
  year={2018}
}

Please contact Shuaichen Chang (chang.1692[at]osu.edu) for questions and suggestions.

Acknowledgement

We thank the authors of Spider for allowing us to redistribute the data in Spider development set.