Home

Awesome

<h1 align="center">NL2SQL Handbook</h1>

From this repository, you can view the latest advancements in NL2SQL. This handbook corresponds to our survey paper: A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?. We also provide tutorial slides to summarize the key points of this survey. Based on the trends in the development of language models, we have created a river diagram of NL2SQL methods to trace the evolution of the NL2SQL field.

If you are a novice, don't worry—we have prepared a practical guide for you, covering a wide range of foundational materials here. We summarized NL2SQL related applications.

<p align="center"> <img width="800" src="./assets/river.svg"/> </p>
@misc{liu2024surveynl2sqllargelanguage,
      title={A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?}, 
      author={Xinyu Liu and Shuyu Shen and Boyan Li and Peixian Ma and Runzhi Jiang and Yuyu Luo and Yuxin Zhang and Ju Fan and Guoliang Li and Nan Tang},
      year={2024},
      eprint={2408.05109},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2408.05109}, 
}

🧭 NL2SQL Introduction

Translating users' natural language queries (NL) into SQL queries can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of NL2SQL has been greatly improved with the emergence of language models (LMs). In this context, it is crucial to assess our current position, determine the NL2SQL solutions that should be adopted for specific scenarios by practitioners, and identify the research topics that researchers should explore next.

<p align="center"> <img width="600" src="./assets/NL2SQL.jpg"/> </p>

📈 NL2SQL Lifecycle

<p align="center"> <img width="800" src="./assets/nl2sql_lifecycle.svg"/> </p>

🤔 Where Are We?

we categorize the challenges of NL2SQL into five levels, each addressing specific hurdles. The first three levels cover challenges that have been or are currently being addressed, reflecting the progressive development of NL2SQL. The fourth level represents the challenges we aim to tackle in the LLMs stage, while the fifth level outlines our vision for NL2SQL system in the next five years.

We describe the evolution of NL2SQL solutions from the perspective of language models, categorizing it into four stages. For each stage of NL2SQL, we analyze the changes in target users and the extent to which challenges are addressed.

<p align="center"> <img width="800" src="./assets/The Evolution of NL2SQL Solutions from the Perspective of Language Models.svg"/> </p>

🧩 Module-based NL2SQL Methods

We summarize the key modules of NL2SQL solutions utilizing the language model.

<p align="center"> <img width="600" src="./assets/An Overview of NL2SQL Method in the LLM Era.svg"/> </p>

📚 NL2SQL Survey & Tutorial

  1. A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  2. Next-generation databas interfaces: A survey of llm-based text-to-sql.<img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey">
  3. Large Language Model Enhanced Text-to-SQL Generation: A Survey. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey">
  4. From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey">
  5. Natural language interfaces for tabular data querying and visualization: A survey. <img src="https://img.shields.io/badge/TKDE'2024-green"> <img src="https://img.shields.io/badge/Paper-grey">
  6. Natural Language Interfaces for Databases with Deep Learning.<img src="https://img.shields.io/badge/VLDB'2023-blue"> <img src="https://img.shields.io/badge/Paper-grey">
  7. A survey on deep learning approaches for text-to-SQL. <img src="https://img.shields.io/badge/VLDBJ'2023-blue"> <img src="https://img.shields.io/badge/Paper-grey">
  8. Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect. <img src="https://img.shields.io/badge/COLING'2022-9cf"> <img src="https://img.shields.io/badge/Paper-grey">
  9. A Deep Dive into Deep Learning Approaches for Text-to-SQL Systems. <img src="https://img.shields.io/badge/SIGMOD'2021-red"> <img src="https://img.shields.io/badge/Paper-grey">
  10. State of the Art and Open Challenges in Natural Language Interfaces to Data. <img src="https://img.shields.io/badge/SIGMOD'2020-red"> <img src="https://img.shields.io/badge/Paper-grey">
  11. Natural language to SQL: Where are we today? <img src="https://img.shields.io/badge/VLDB'2020-blue"> <img src="https://img.shields.io/badge/Paper-grey">

📰 NL2SQL Paper List

  1. The Dawn of Natural Language to SQL: Are We Fully Ready? <img src="https://img.shields.io/badge/VLDB'2024-blue"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  2. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. <img src="https://img.shields.io/badge/VLDB'2024-blue"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  3. Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation. <img src="https://img.shields.io/badge/VLDB'2024-blue"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  4. Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language Models. <img src="https://img.shields.io/badge/VLDB'2024-blue"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  5. ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems.<img src="https://img.shields.io/badge/VLDB'2024-blue"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  6. CodeS: Towards Building Open-source Language Models for Text-to-SQL. <img src="https://img.shields.io/badge/SIGMOD'2024-red"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  7. FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis. <img src="https://img.shields.io/badge/SIGMOD'2024-red"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  8. PURPLE: Making a Large Language Model a Better SQL Writer. <img src="https://img.shields.io/badge/ICDE'2024-green"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  9. METASQL: A Generate-then-Rank Framework for Natural Language to SQL Translation. <img src="https://img.shields.io/badge/ICDE'2024-green"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  10. Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning. <img src="https://img.shields.io/badge/ACL'2024-9cf"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  11. Synthesizing Text-to-SQL Data from Weak and Strong LLMs. <img src="https://img.shields.io/badge/ACL'2024-9cf"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  12. Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark. <img src="https://img.shields.io/badge/ACL'2024-9cf"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  13. I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation. <img src="https://img.shields.io/badge/EMNLP'2024-orange"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  14. PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL. <img src="https://img.shields.io/badge/EMNLP'2024-orange"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  15. Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning. <img src="https://img.shields.io/badge/EMNLP'2024-orange"> <img src="https://img.shields.io/badge/Paper-grey">
  16. CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey">
  17. Towards Optimizing SQL Generation via LLM Routing. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey">
  18. XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey">
  19. E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  20. DB-GPT: Empowering Database Interactions with Private Large Language Models. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  21. The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey">
  22. DBCopilot: Scaling Natural Language Querying to Massive Databases. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  23. CHESS: Contextual Harnessing for Efficient SQL Synthesis. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  24. PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  25. CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  26. AMBROSIA: A Benchmark for Parsing Ambiguous Questions into Database Queries. <img src="https://img.shields.io/badge/arXiv'2024-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  27. Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning. <img src="https://img.shields.io/badge/VLDB'2023-blue"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  28. CatSQL: Towards Real World Natural Language to SQL Applications. <img src="https://img.shields.io/badge/VLDB'2023-blue"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  29. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. <img src="https://img.shields.io/badge/NeurIPS'2023-yellow"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  30. ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought. <img src="https://img.shields.io/badge/EMNLP'2023-orange"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  31. Selective Demonstrations for Cross-domain Text-to-SQL. <img src="https://img.shields.io/badge/EMNLP'2023-orange"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  32. RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. <img src="https://img.shields.io/badge/AAAI'2023-cyan"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  33. Graphix-T5: Mixing Pre-trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing. <img src="https://img.shields.io/badge/AAAI'2023-cyan"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  34. Improving Generalization in Language Model-based Text-to-SQL Semantic Parsing: Two Simple Semantic Boundary-based Techniques. <img src="https://img.shields.io/badge/ACL'2023-9cf"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  35. G<sup>3</sup>R: A Graph-Guided Generate-and-Rerank Framework for Complex and Cross-domain Text-to-SQL Generation. <img src="https://img.shields.io/badge/ACL(findings)'2023-9cf"> <img src="https://img.shields.io/badge/Paper-grey">
  36. Importance of Synthesizing High-quality Data for Text-to-SQL Parsing. <img src="https://img.shields.io/badge/ACL(findings)'2023-9cf"> <img src="https://img.shields.io/badge/Paper-grey">
  37. Know What I don’t Know: Handling Ambiguous and Unknown Questions for Text-to-SQL. <img src="https://img.shields.io/badge/ACL(findings)'2023-9cf"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  38. C3: Zero-shot Text-to-SQL with ChatGPT <img src="https://img.shields.io/badge/arXiv'2023-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  39. MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL. <img src="https://img.shields.io/badge/arXiv'2023-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">
  40. SQLformer: Deep Auto-Regressive Query Graph Generation for Text-to-SQL Translation. <img src="https://img.shields.io/badge/arXiv'2023-purple"> <img src="https://img.shields.io/badge/Paper-grey"> <img src="https://img.shields.io/badge/Code-grey">

📊 NL2SQL Benchmark

We create a timeline of the benchmark's development and mark relevant milestones. You can get more details from this chapter: 📊 Benchmark

<p align="center"> <img width="800" src="./assets/Dataset_timeline.svg"/> </p>

🎯 Where Are We Going?

📖 Catalog for Our Survey

You can get more information from our subsection. We introduce representative papers on related concepts:

💾 Practical Guide for Novice

📊 How to get data:

🛠️ How to build an LLM-based NL2SQL model:

🔎How to evaluate your model:

We collect NL2SQL evaluation metrics for you. You can get more details from this chapter: Evaluation

🗺️ Roadmap and Decision Flow

You can get some inspiration from the Roadmap and Decision Flow.

<p align="center"> <img width="800" src="./assets/NL2SQL_Guidance.svg"/> </p>

📱 NL2SQL Related Applications: