Awesome

SC-prompt

Introduction

This repository contains the code for the paper "Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning". In this paper, we propose SC-Prompt, a novel divide-and-conquer strategy for effectively supporting Text-to-SQL translation in the few-shot scenario.

Setup

git clone git@github.com:ruc-datalab/SC-prompt.git
cd SC-prompt
mkdir -p -m 777 experimental_outputs
mkdir -p -m 777 transformers_cache
cd experimental_outputs
mkdir -p -m 777 spider
mkdir -p -m 777 cosql
mkdir -p -m 777 geoquery
cd ..

Dataset Download

Spider: Put it under src/datasets/spider.
Cosql: Put it under src/datasets/cosql.
Geoquery: Put it under src/datasets/geoquery.

Code Structure

|-- experimental_outputs # save the fine-tuned models and evaluation results
|-- scripts # the train/inference script
|-- src
    |-- datasets # the class to preprocess the dataset 
    |-- metrics # the class to evaluate the prediction results
    |-- utils # main code
    |-- run.py # the class to train/inference the few-shot text-to-sql model

Environment

Our constrained decoding method is based on the parser provided by Picard. Please use the Docker image provided by the official repository to build the container.

docker run -itd --gpus '"device=<your_available_gpu_ids>"' --rm --user 13011:13011 --mount type=bind,source=<your_base_dir>/transformers_cache,target=/transformers_cache --mount type=bind,source=<your_base_dir>/scripts,target=/app/scripts --mount type=bind,source=<your_base_dir>/experimental_outputs,target=/app/experimental_outputs --mount type=bind,source=<your_base_dir>/src,target=/app/src tscholak/text-to-sql-eval:6a252386bed6d4233f0f13f4562d8ae8608e7445

You should set <your_available_gpu_ids> and <your_base_dir>.

Quick Inference

Download the fine-tuned model and put it under the corresponding folder.

Dataset	#Train	Model	Folder
Spider	0.05 (350)	link	experimental_outputs/spider/
Spider	0.1 (700)	link	experimental_outputs/spider/
CoSQL	0.05 (475)	link	experimental_outputs/cosql/
CoSQL	0.1 (950)	link	experimental_outputs/cosql/
Geoquery	1. (536)	link	experimental_outputs/geoquery/

Use the scripts to inference.

# Inference on spider
CUDA_VISIBLE_DEVICES=0 bash scripts/eval_spider_scprompt.sh 0.1
# Inference on cosql
CUDA_VISIBLE_DEVICES=0 bash scripts/eval_cosql_scprompt.sh 0.1
# Inference on geoquery
CUDA_VISIBLE_DEVICES=0 bash scripts/eval_geoquery_scprompt.sh 1.

The second argument refers to the proportion of using the official training set.

Train from scrach

# Train on spider
CUDA_VISIBLE_DEVICES=0 bash scripts/train_spider_scprompt.sh 0.1
# Train on cosql
CUDA_VISIBLE_DEVICES=0 bash scripts/train_cosql_scprompt.sh 0.1
# Train on geoquery
CUDA_VISIBLE_DEVICES=0 bash scripts/train_geoquery_scprompt.sh 1.

The second argument refers to the proportion of using the official training set.

The best model will be automatically saved at experimental_outputs/. Please note that training does not use the fine-grained constrained decoding strategy, which is only necessary for evaluation. Please refer to Quick Inferenceto evaluate the fine-tuned model.