Home

Awesome

Open in HF Spaces

Finetune Datasets

Validation Datasets

Local Large Language Models

Summer 2023 Approaches

1. SQL Guanaco 13B

2. Spider Wizard Coder

3. Spider NatSQL Skeleton WizardCoder

4. Spider Skeleton Wizard Coder

5. ChatGPT

Fall 2023 Approaches

1. Spider Skeleton Wizard Coder + ChatGPT Ranked Schema

2. Spider Skeleton Wizard Coder + 5 Beams + ChatGPT Ranked Schema

3. Spider Skeleton Wizard Coder + 5 Beams + ChatGPT Choose Best SQL + ChatGPT Ranked Schema

4. ChatGPT + Alignment + Clear Context

		# singer ( singer_id, name, country, age )
		# stadium ( capacity, highest, lowest, average )
		# concert ( theme, year, concert_id, concert_name )
		# singer_in_concert ( concert_id, singer_id )
		# concert.stadium_id = stadium.stadium_id
		# singer_in_concert.singer_id = singer.singer_id
		# singer_in_concert.concert_id = concert.concert_id

5. ChatGPT + Alignment + Clear Context + Error Correction

6. (SQLChatGPT) ChatGPT + Alignment + Clear Context + Error Correction + Example Driven Correction

7. SQLChatGPT + SELECT and WHERE Clause Repair

8. Finetuned SQLChatGPT + SELECT and WHERE Clause Repair

ChatGPT Alignment Prompt format:

Role: SYSTEM
Content:
You are now an excellent SQL writer, first I’ll give you some tips and examples, and I need you to remember the tips, and do not make same mistakes.

Role: USER Content:
Tips 1:
Question: Which A has most number of B?
Gold SQL: select A from B group by A order by count ( * ) desc limit 1; Notice that the Gold SQL doesn’t select COUNT(*) because the question only wants to know the A and the number should be only used in ORDER BY clause, there are many questions asks in this way, and I need you to remember this in the the following questions.

Role: ASSISTANT
Content:
Thank you for the tip! I’ll keep in mind that when the question only asks for a certain field, I should not include the COUNT(*) in the SELECT statement, but instead use it in the ORDER BY clause to sort the results based on the count of that field.

Role: USER
Content:
Tips 2:
Don’t use "IN", "OR", "LEFT JOIN" as it might cause extra results, use "INTERSECT" or "EXCEPT" instead, and remember to use "DISTINCT" or "LIMIT" when necessary.
For example,
Question: Who are the A who have been nominated for both B award and C award?
Gold SQL should be: select A from X where award = ’B’ intersect select A from X where award = ’C’;

Role: ASSISTANT
Content:
Thank you for the tip! I’ll remember to use "INTERSECT" or "EXCEPT" instead of "IN", "NOT IN", or "LEFT JOIN" when I want to find records that match or don’t match across two tables. Additionally, I’ll make sure to use "DISTINCT" or "LIMIT" when necessary to avoid repetitive results or limit the number of results returned.

Role: USER (Actual prompt) Content:

### Complete sqlite SQL query only and with no explanation, and do not select extra columns that are not explicitly requested in the query.
### Sqlite SQL tables, with their properties:
#
# singer ( singer_id, name, country, age )
# stadium ( capacity, highest, lowest, average )
# concert ( theme, year, concert_id, concert_name )
# singer_in_concert ( concert_id, singer_id )
# concert.stadium_id = stadium.stadium_id
# singer_in_concert.singer_id = singer.singer_id
# singer_in_concert.concert_id = concert.concert_id
#


### How many singers do we have?
SELECT

Citations

@misc{dong2023c3,
      title={C3: Zero-shot Text-to-SQL with ChatGPT}, 
      author={Xuemei Dong and Chao Zhang and Yuhang Ge and Yuren Mao and Yunjun Gao and lu Chen and Jinshu Lin and Dongfang Lou},
      year={2023},
      eprint={2307.07306},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{luo2023wizardcoder,
      title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct}, 
      author={Ziyang Luo and Can Xu and Pu Zhao and Qingfeng Sun and Xiubo Geng and Wenxiang Hu and Chongyang Tao and Jing Ma and Qingwei Lin and Daxin Jiang},
      year={2023},
}
@article{yu2018spider,
  title={Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task},
  author={Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others},
  journal={arXiv preprint arXiv:1809.08887},
  year={2018}
}
@inproceedings{gan-etal-2021-natural-sql,
    title = "Natural {SQL}: Making {SQL} Easier to Infer from Natural Language Specifications",
    author = "Gan, Yujian  and
      Chen, Xinyun  and
      Xie, Jinxia  and
      Purver, Matthew  and
      Woodward, John R.  and
      Drake, John  and
      Zhang, Qiaofu",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.174",
    doi = "10.18653/v1/2021.findings-emnlp.174",
    pages = "2030--2042",
}
@article{dettmers2023qlora,
  title={QLoRA: Efficient Finetuning of Quantized LLMs},
  author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  journal={arXiv preprint arXiv:2305.14314},
  year={2023}
}
@inproceedings{li2022resdsql,
  author = {Haoyang Li and Jing Zhang and Cuiping Li and Hong Chen},
  title = "RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL",
  booktitle = "AAAI",
  year = "2023"
}