Home

Awesome

BIRD-SQL Mini-Dev

<p align="center"> <img src="materials/bird_circle_main.png" style="width: 30%; min-width: 100px; display: block; margin: auto;"> </p> <p align="center" width="100%"> <a href="https://arxiv.org/abs/2305.03111">🔗Paper</a> <a href="https://bird-bench.github.io/">🏆Leaderboard</a> <p>

License Data Link Python 3.11+ Leaderboard 1.8+ OpenAI 1.30+ SQLite 3.41+ MySQL 8.40+ PostgreSQL 14.12+

<p align="center" width="100%"> <a><img src="materials/intro.png" style="width: 100%; min-width: 300px; display: block; margin: auto;"></a> </p>

Overview

Here, we provide a Lite version of developtment dataset: Mini-Dev. This mini-dev dataset is designed to facilitate efficient and cost-effective development cycles, especially for testing and refining SQL query generation models. This dataset results from community feedback, leading to the compilation of 500 high-quality text2sql pairs derived from 11 distinct databases in a development environment. To further enhance the practicality of the BIRD system in industry settings and support the development of text-to-SQL models, we make the Mini-Dev dataset available in both MySQL and PostgreSQL.

Additionally, we introduce two new evaluation metrics for the Mini-Dev dataset: the Reward-based Valid Efficiency Score (R-VES) and the Soft F1-Score. These metrics aim to evaluate the efficiency and accuracy of text-to-SQL models, respectively. It is important to note that the both metrics, currently in their beta version, applies exclusively to the Mini-Dev dataset using baseline models.

We welcome contributions and suggestions for enhancing these metrics, particularly regarding their integration into existing leaderboards. Please do not hesitate to contact us if you are interested in these developments or have any proposals for improvements.

Below are some key statistics of the mini-dev dataset:

Difficulty Distribution

Database Distribution

Keywords Statistic

Dataset Introduction

The dataset contains the main following resources:

[!NOTE] You have to download the latest dev databases in order to construct database in the MySQL and PostgreSQL. If you use the SQLite version only, you can use the original dev databases.

Mini-Dev Dataset in MySQL and PostgreSQL

You can locate the SQL queries within the mini_dev_mysql.json and mini_dev_postgresql.json files. These queries have been transpiled from the original SQLite versions using the sqlglot package, then refined manually and with GPT-4 Turbo. After downloading the Mini-Dev dataset, each database folder will contain .sql and command.script files. Follow the instructions below to set up the database in MySQL and PostgreSQL:

MySQL

  1. Download and install the MySQL from the official website: https://dev.mysql.com/downloads/mysql/
  2. Set the environment variables:
export PATH=$PATH:/usr/local/mysql/bin
  1. Start the MySQL server:
sudo /usr/local/mysql/support-files/mysql.server start
  1. Login to the MySQL server and create the database (password will be the one you set during the installation)
mysql -u root -p
CREATE DATABASE BIRD;
  1. Construct the database by run the following command (You can find MySQL version database: BIRD_dev.sql in the MINIDEV_mysql folder):
mysql -u root -p BIRD < BIRD_dev.sql
  1. Examples that how to run mysql query in the Python (with pymysql) can be find in the examples/mysql_example.ipynb file.

  2. If you encounter the error: "this is incompatible with sql_mode=only_full_group_by", you can run the following command to disable the sql_mode:

select @@global.sql_mode;
SET GLOBAL sql_mode='{EVERYTHING SHOW IN THE ABOVE COMMAND EXCEPT ONLY_FULL_GROUP_BY}';

PostgreSQL

  1. Download and install the postgresql from the official website: https://www.postgresql.org/download/
  2. Download the pgAdmin4 from the official website: https://www.pgadmin.org/download/ (Recommended to monitor the database)
  3. In pgADmin4/terminal create a new database called BIRD
  4. Construct the database by run the following command (You can find PostgreSQL version database:BIRD_dev.sql in the MINIDEV_postgresql folder):
psql -U USERNAME -d BIRD -f BIRD_dev.sql
  1. Examples that how to run mysql query in the Python (with Psycopg) can be find in the examples/postgresql_example.ipynb file.

In-Context Learning (ICL):

Environment Setup:

First, you need install openai in your python environment by:

conda create -n BIRD python=3.11.5
pip install requirements.txt

Collect results

Use this script to run the OpenAI model on the Azure cloud. (you may need to adjust parameters and paths with your preference):

cd ./llm/
sh ./run/run_gpt.sh

Evaluation:

Execution (EX) Evaluation:

Please post-process your collected results as the format: SQL and its db_id, which is splitted by '\t----- bird -----\t'. The examples are shown in the ./llm/exp_result/turbo_output/predict_mini_dev_gpt-4-turbo_cot_SQLite.json. Put the ground-truth sql file in the ./data/. And you may need to design a ChatGPT tag by your own. The main file for ex evaluation is located at ./llm/src/evaluation_ex.py.
Then you could evaluate the results by the following command line :

cd ./llm/
sh ./run/run_evaluation.sh

Reward-based Valid Efficiency Score (R-VES):

The main file for R-VES evaluation is located at ./llm/src/evaluation_ves.py. R-VES and EX can be evaluated in the same shell, so you can eval your efficiency via:

cd ./llm/
sh ./run/run_evaluation.sh

(For stable R-VES, you may need to enlarge timeout or repeat and average results. In our test evaluation, we will enlarge timeout to 3 s/ex; then we repeat 5 times for VES computation, only the highest results will be reported.)

In the latest version, we adjust the VES evaluation to be more stable and reliable. Instead of simply measuring the time ratio between predict and ground-truth SQLs, we now assign reward point based on the time ratio. The R-VES are calculated as follows:

<p align="center" width="100%"> <a><img src="materials/time_ratio_formula.png" style="width: 70%; min-width: 300px; display: block; margin: auto;"></a> </p>

Soft F1-Score Evaluation:

The main file for Soft F1-Score evaluation is located at ./llm/src/evaluation_f1.py. Soft-F1, VES and EX can be evaluated in the same shell, so you can eval your efficiency via:

cd ./llm/
sh ./run/run_evaluation.sh

Soft F1-Score:

Alongside the update to the Mini-Dev set, we introduced a new evaluation metric—the soft F1-score. This metric is specifically designed to assess the performance of text-to-SQL models by measuring the similarity between the tables produced by predicted SQL queries and those from the ground truth. In a nutshell, the soft F1-score is a more lenient metric that reduces the impact of column order and missing values in the tables produced by predicted SQL queries.

The following demonstrate how we calculate the soft F1-score.

Ground truth SQL resulted table:

Row
1'Apple'325
2'Orange'
3'Banana'119

Predicted SQL resulted table:

Row
1325'Apple'
2191'Orange'
3'Banana'

The soft F1-score is calculated as follows:

MatchedPred_onlyGold_only
Row 1200
Row 2110
Row 3101

Baseline performance on Mini-Dev Dataset

EX Evaluation

SQLiteMySQLPostgreSQL
mixtral-8x7b21.6013.6012.40
llama3-8b-instruct24.4024.6018.40
phi-3-medium-128k-instruct30.6025.0021.60
gpt-35-turbo-instruct33.6031.2026.60
gpt-35-turbo38.0036.0027.40
llama3-70b-instruct40.8037.0029.40
TA + gpt-35-turbo41.60--
TA + llama3-70b-instruct42.80--
gpt-4-turbo45.8041.0036.00
gpt-4-32k47.0043.2035.00
gpt-447.8040.8035.80
TA + gpt-4-turbo58.00--
TA + gpt-4o63.00--

R-VES Evaluation

SQLiteMySQLPostgreSQL
mixtral-8x7b20.4112.9914.16
llama3-8b-instruct23.2723.6617.90
phi-3-medium-128k-instruct29.5424.1221.07
gpt-35-turbo-instruct32.2830.3926.14
gpt-35-turbo37.3334.9426.80
llama3-70b-instruct39.0235.8228.80
TA + gpt-35-turbo40.59--
TA + llama3-70b-instruct41.37--
gpt-4-turbo44.7939.3735.23
gpt-4-32k45.2942.7934.59
gpt-445.9139.9235.24
TA + gpt-4-turbo56.44--
TA + gpt-4o60.86--

Soft F1-Score Evaluation

SQLiteMySQLPostgreSQL
mixtral-8x7b22.9513.7914.70
llama3-8b-instruct27.8727.4919.35
phi-3-medium-128k-instruct35.3328.7324.11
gpt-35-turbo-instruct36.3433.8528.30
gpt-35-turbo41.8440.7530.22
TA + gpt-35-turbo44.25--
llama3-70b-instruct44.3840.9531.43
TA + llama3-70b-instruct46.66--
gpt-4-turbo50.0845.9638.36
gpt-4-32k51.9247.3839.55
gpt-452.6945.7838.96
TA + gpt-4-turbo62.40--
TA + gpt-4o66.97--

Predict SQLs

We drop the predicted SQLs of baseline models under ./llm/exp_result/sql_output_kg/ for reference.

Time Ratio Distribution

<p align="center" width="100%"> <a><img src="materials/time_ratio_sqlite.png" style="width: 100%; min-width: 300px; display: block; margin: auto;"></a> </p>

Acknowledgement

Main contributors to the Mini-Dev project: Xiaolong Li, Jinyang Li, Ge Qu, Binyuan Hui, Reynold Cheng, Chenhao Ma.

We extend our sincere gratitude to the invaluable feedbacks from the open community, including github reviewers (@freiz @nnarodytska @josem7 @wbbeyourself @ronch99 @tshu-w ) and those who reached out to us via email with their valuable suggestions.

For any questions, please contact us by bird.bench23@gmail.com.

My To-Do List

Citation

Please cite the repo if you think our work is helpful to you.

@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}