Awesome
SC-prompt
Introduction
This repository contains the code for the paper "Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning". In this paper, we propose SC-Prompt, a novel divide-and-conquer strategy for effectively supporting Text-to-SQL translation in the few-shot scenario.
Setup
git clone git@github.com:ruc-datalab/SC-prompt.git
cd SC-prompt
mkdir -p -m 777 experimental_outputs
mkdir -p -m 777 transformers_cache
cd experimental_outputs
mkdir -p -m 777 spider
mkdir -p -m 777 cosql
mkdir -p -m 777 geoquery
cd ..
Dataset Download
- Spider: Put it under
src/datasets/spider
. - Cosql: Put it under
src/datasets/cosql
. - Geoquery: Put it under
src/datasets/geoquery
.
Code Structure
|-- experimental_outputs # save the fine-tuned models and evaluation results
|-- scripts # the train/inference script
|-- src
|-- datasets # the class to preprocess the dataset
|-- metrics # the class to evaluate the prediction results
|-- utils # main code
|-- run.py # the class to train/inference the few-shot text-to-sql model
Environment
Our constrained decoding method is based on the parser provided by Picard. Please use the Docker image provided by the official repository to build the container.
docker run -itd --gpus '"device=<your_available_gpu_ids>"' --rm --user 13011:13011 --mount type=bind,source=<your_base_dir>/transformers_cache,target=/transformers_cache --mount type=bind,source=<your_base_dir>/scripts,target=/app/scripts --mount type=bind,source=<your_base_dir>/experimental_outputs,target=/app/experimental_outputs --mount type=bind,source=<your_base_dir>/src,target=/app/src tscholak/text-to-sql-eval:6a252386bed6d4233f0f13f4562d8ae8608e7445
You should set <your_available_gpu_ids>
and <your_base_dir>
.
Quick Inference
Download the fine-tuned model and put it under the corresponding folder.
Dataset | #Train | Model | Folder |
---|---|---|---|
Spider | 0.05 (350) | link | experimental_outputs/spider/ |
Spider | 0.1 (700) | link | experimental_outputs/spider/ |
CoSQL | 0.05 (475) | link | experimental_outputs/cosql/ |
CoSQL | 0.1 (950) | link | experimental_outputs/cosql/ |
Geoquery | 1. (536) | link | experimental_outputs/geoquery/ |
Use the scripts to inference.
# Inference on spider
CUDA_VISIBLE_DEVICES=0 bash scripts/eval_spider_scprompt.sh 0.1
# Inference on cosql
CUDA_VISIBLE_DEVICES=0 bash scripts/eval_cosql_scprompt.sh 0.1
# Inference on geoquery
CUDA_VISIBLE_DEVICES=0 bash scripts/eval_geoquery_scprompt.sh 1.
- The second argument refers to the proportion of using the official training set.
Train from scrach
# Train on spider
CUDA_VISIBLE_DEVICES=0 bash scripts/train_spider_scprompt.sh 0.1
# Train on cosql
CUDA_VISIBLE_DEVICES=0 bash scripts/train_cosql_scprompt.sh 0.1
# Train on geoquery
CUDA_VISIBLE_DEVICES=0 bash scripts/train_geoquery_scprompt.sh 1.
- The second argument refers to the proportion of using the official training set.
The best model will be automatically saved at experimental_outputs/
. Please note that training does not use the fine-grained constrained decoding strategy, which is only necessary for evaluation. Please refer to Quick Inference
to evaluate the fine-tuned model.