Awesome
ZeroNL2SQL
:thought_balloon: Introduction
This repository contains the code for our VLDB2024 paper “Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL”.
:open_file_folder: Data Preparation
Train data
- Spider: Put it under
src/datasets/spider
.
Test data
- KaggleDBQA: Put it under
src/datasets/kaggledbqa
. - Dr.Spider: Put it under
src/datasets/drspider
.
mkdir data/
unzip src/datasets/kaggledbqa/kaggledbqa.zip -d data/
unzip src/datasets/drspider/drspider.zip -d data/
# Don't delete the original .zip file
:computer: Environment Preparation
Please refer to requirements.txt
to download the relevant toolkits.
Prepare the following folders:
cd ZeroNL2SQL
mkdir logs
mkdir experimental_outputs/train/template_generator
mkdir experimental_outputs/train/aligner
:zap: Quick Start
Download models
- Template Generator: Put it under
experimental_outputs/train/template_generator
. - Aligner: Put it under
experimental_outputs/train/aligner
.
Text-to-SQL inference
Use the following script to directly infer on the text-to-sql test set. This script will take four steps: 1. generate SQL template; 2. align (SELECT, STRUCTURE) with the user question; 3. prepare data for LLM inference; 4. text2sql using LLM.
CUDA_VISIBLE_DEVICES={gpu_id} bash scripts/infer_LLM_with_template.sh test_set_name your_openai_key
- The first argument is the name of the test set, which can be selected from
kaggledbqa
,DB_DBcontent_equivalence
,DB_schema_abbreviation
,DB_schema_synonym
,NLQ_keyword_synonym
,NLQ_keyword_carrier
,NLQ_column_synonym
,NLQ_column_carrier
,NLQ_column_attribute
,NLQ_column_value
,NLQ_value_synonym
,NLQ_multitype
,NLQ_others
,SQL_comparison
,SQL_sort_order
,SQL_NonDB_number
,SQL_DB_text
,SQL_DB_number
. - The second argument is your openai key, which you can obtain from the official website.
Note that we evaluate the text-to-SQL results using the test_suite_evaluation, and the evaluation results are presented in eval.output
.
:open_hands: Train From Scratch
Train template generator
CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run.py configs/train_template_generator.json"
The best model will be saved at experimental_outputs/train/template_generator/BEST_MODEL/
.
Train aligner
CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run_aligner.py configs/train_aligner.json"
The best model will be saved at experimental_outputs/train/aligner/checkpoint_best.pkl
.
:speech_balloon:Citation
If our code is helpful to you, please cite our work:
@misc{gu2023interleaving,
title={Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation},
author={Zihui Gu and Ju Fan and Nan Tang and Songyue Zhang and Yuxin Zhang and Zui Chen and Lei Cao and Guoliang Li and Sam Madden and Xiaoyong Du},
year={2023},
eprint={2306.08891},
archivePrefix={arXiv},
primaryClass={cs.CL}
}