LLMs之Text2SQL：《A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?》

一个处女座的程序猿

已于 2025-01-07 23:10:42 修改

阅读量1.7k

点赞数 11

分类专栏： NLP/LLMs 文章标签： LLMs SQL NL2SQL

于 2024-08-17 02:00:18 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/141273084

版权

NLP/LLMs 专栏收录该内容

765 篇文章

订阅专栏

LLMs之Text2SQL：《A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?》翻译与解读

导读：2024年8月9日发布，论文全面回顾了 LLMs 时代 NL2SQL 技术的演变，从生命周期的角度进行了分析，涵盖了模型、数据、评估和错误分析等方面。论文总结了语言模型驱动的 NL2SQL 方法的关键模块，并分析了各种 NL2SQL 基准数据集和评估指标，提出了优化 LLMs 用于 NL2SQL 任务的路线图，并探讨了该领域的开放性问题和研究挑战。

>> NL2SQL任务的定义：将自然语言查询（NL）转换为可执行的 SQL 查询，降低用户访问关系型数据库的门槛，支持各种商业应用。

>> 背景痛点—NL2SQL 任务的挑战：

自然语言到SQL（NL2SQL）转换的重要性：NL2SQL技术可以降低访问关系数据库的难度，支持商业应用，如商业智能和客户支持。

挑战与问题

● 自然语言的模糊性：词汇歧义、语法歧义、信息不足、用户错误。

● 数据库的复杂性：多表之间复杂的关联关系、相似列名、领域特定的模式差异、大量脏数据。

● NL2SQL 翻译：从自由形式的自然语言到受约束的 SQL 语法之间的转换，存在一对多映射，需要考虑数据库模式依赖。

● 技术挑战：模型效率、SQL 效率、成本效益、数据不足和噪音、数据隐私、可信度和可靠性。

>> 具体的解决方案：

(1)、语言模型驱动的 NL2SQL 解决方案：

T1、基于规则方法：早期方法，使用预定义规则或语义解析器，但缺乏适应性、可扩展性和泛化能力。

T2、基于神经网络方法：使用序列到序列模型或图神经网络，但受模型大小和训练数据量的限制。

T3、基于PLM方法：使用 BERT 和 T5 等预训练语言模型，在各种基准数据集上取得了竞争性性能，但需要针对 NL2SQL 任务进行微调。

T4、基于LLM方法：利用 LLMs 的涌现能力，通过提示设计和训练 LLMs 来解决 NL2SQL 任务，取得了显著进展，但仍面临数据库层面的挑战。

(2)、生命周期框架：

● 模型：解决自然语言模糊性并正确映射数据库模式。

● 数据：收集和合成高质量的训练数据。

● 评估：多角度评估NL2SQL方法。

● 错误分析：找出错误根源以改进模型。

>> 核心思路步骤：

(1)、预处理：

● 模式链接：识别与 NL 查询相关的表和列。

● 数据库内容检索：检索用于生成 SQL 查询的数据库内容或单元格值。

● 额外信息获取：整合领域特定知识，丰富上下文背景。

(2)、NL2SQL 翻译：

● 编码策略：将 NL 和数据库模式转换为内部表示，捕获语义和结构信息。

● 解码策略：将内部表示转换为 SQL 查询。

● 特定任务提示策略：为 NL2SQL 模型提供定制指导，优化翻译流程。

● 中间表示：在 NL 和 SQL 翻译之间建立桥梁，提供结构化方法来抽象、对齐和优化 NL 理解，简化复杂推理，并指导生成准确的 SQL 查询。

(3)、后处理：

● SQL 校正策略：识别和纠正生成的 SQL 查询中的语法错误。

● 输出一致性：通过采样多个推理结果并选择最一致的结果来确保 SQL 查询的统一性。

● 执行引导策略：使用 SQL 查询的执行结果来指导后续的细化。

● N-best 排名策略：对模型的 top-n 结果进行重新排序，以提高查询精度。

>> 优势：

● 性能提升：LLM显著提升了自然语言理解和生成SQL的准确性和效率。

● 适应性强：能够处理复杂数据库模式和多种自然语言表达。涌现能力，能够通过提示进行 NL2SQL 任务。能够处理更复杂的数据库模式和 SQL 查询。能够适应不同的数据库领域和 NL 变化。

● 易于优化：通过提示工程和模型微调，可以针对具体任务进行优化。

系统地总结了NL2SQL在LLM时代的挑战、解决方案和未来研究方向，提供了开发和优化NL2SQL模型的实用指导。

《A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?》翻译与解读

Abstract

1 Introduction

Fig. 1: An Overview of the Survey: The Lifecycle of the NL2SQL Task.

XI CONCLUSION

《A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?》翻译与解读

地址	论文地址：https://arxiv.org/pdf/2408.05109
时间	2024年8月9日
作者

Abstract

Translating users' natural language queries (NL) into SQL queries (i.e., NL2SQL) can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of NL2SQL has been greatly enhanced with the emergence of Large Language Models (LLMs). In this survey, we provide a comprehensive review of NL2SQL techniques powered by LLMs, covering its entire lifecycle from the following four aspects: (1) Model: NL2SQL translation techniques that tackle not only NL ambiguity and under-specification, but also properly map NL with database schema and instances; (2) Data: From the collection of training data, data synthesis due to training data scarcity, to NL2SQL benchmarks; (3) Evaluation: Evaluating NL2SQL methods from multiple angles using different metrics and granularities; and (4) Error Analysis: analyzing NL2SQL errors to find the root cause and guiding NL2SQL models to evolve. Moreover, we provide a rule of thumb for developing NL2SQL solutions. Finally, we discuss the research challenges and open problems of NL2SQL in the LLMs era.

将用户的自然语言查询（NL）转换为SQL查询（即NL2SQL）可以显著降低访问关系数据库的障碍，并支持各种商业应用。随着大型语言模型（LLMs）的出现，NL2SQL的性能得到了极大的提升。在本综述中，我们提供了由LLMs驱动的NL2SQL技术的全面回顾，覆盖了从以下四个方面来看其整个生命周期：

(1) 模型：NL2SQL不仅处理NL的模糊性和欠指定性，而且正确地将NL与数据库模式和实例映射的技术；

(2) 数据：从训练数据收集、由于训练数据稀缺而进行的数据合成到NL2SQL基准测试；

(3) 评估：使用不同的指标和粒度从多个角度评估NL2SQL方法；以及

(4) 错误分析：分析NL2SQL错误以找到根本原因，并指导NL2SQL模型的发展。

此外，我们提供了一套开发NL2SQL解决方案的经验法则。最后，我们讨论了LLMs时代下NL2SQL的研究挑战和开放问题。

1 Introduction

Natural Language to SQL (i.e., nl2sql), which converts a natural language query (nl) into an sql query, is a key technique toward lowering the barrier to accessing relational databases [1, 2, 3, 4, 5, 6, 7]. This technique supports various important applications such as business intelligence, customer support, and more, making it a key step toward democratizing data science [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. Recent advancements in language models have significantly extended the frontiers of research and application in nl2sql. Concurrently, the trend among database vendors to offer nl2sql solutions has evolved from a mere notion to a necessary strategy [20, 21]. Therefore, it’s important for us to understand the fundamentals, techniques, and challenges regarding nl2sql.

In this survey, we will systematically review recent nl2sql techniques through a new framework, as shown in Figure LABEL:fig:overview.

• NL2SQL with Language Models. We will first review existing nl2sql solutions from the perspective of language models, categorizing them into four major categories (see Figure LABEL:fig:overview(a)). We will then focus on the recent advances in Pre-trained Language Models (PLMs) and Large Language Models (LLMs) for nl2sql.

• Benchmarks and Training Data Synthesis. Undoubtedly, the performance of PLM- and LLM-based nl2sql models is highly dependent on the amount and quality of the training data. Therefore, we will first summarize the characteristics of existing benchmarks and analyze their statistical information (e.g., database and query complexity) in detail. We will then discuss methods for collecting and synthesizing high-quality training data, highlighting this as a research opportunity (see Figure LABEL:fig:overview(b)).

• Evaluation. Comprehensively evaluating nl2sql models is crucial for optimizing and selecting models for different usage scenarios. We will discuss the multi-angle evaluation and scenario-based evaluation for the nl2sql task (see Figure LABEL:fig:overview(c)). For example, performance can be assessed in specific contexts by filtering datasets based on SQL characteristics, NL variants, and database domains.

• NL2SQL Error Analysis. Error analysis is essential in nl2sql research for identifying model limitations. We review existing error taxonomies, analyze their limitations, and propose principles for designing taxonomies for nl2sql output errors. Using these principles, we create a two-level error taxonomy and utilize it to summarize and analyze nl2sql output errors (see Figure LABEL:fig:overview(d)).

Next, we will introduce practical guidance for developing nl2sql solutions, including a roadmap we designed for optimizing LLMs to nl2sql task, along with a decision flow we created for selecting nl2sql modules tailored to different nl2sql scenarios. Finally, we will introduce some interesting and important open problems in the field of nl2sql, including open-world nl2sql tasks, cost-effective nl2sql with LLMs, and trustworthy nl2sql solutions.

自然语言转SQL（即nl2sql），即将自然语言查询（nl）转换成SQL查询，是降低访问关系数据库门槛的关键技术[1, 2, 3, 4, 5, 6, 7]。该技术支持多种重要应用，如商业智能、客户支持等，使其成为普及数据科学的重要一步[8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]。最近语言模型的进步大大扩展了nl2sql研究和应用的前沿。同时，数据库供应商提供nl2sql解决方案的趋势已经从一个单纯的概念发展成为必要的策略[20, 21]。因此，了解关于nl2sql的基本原理、技术和挑战是非常重要的。

在这篇综述中，我们将通过一个新的框架系统地回顾最新的nl2sql技术，如图LABEL:fig:overview所示。

• 基于语言模型的NL2SQL。首先，我们将从语言模型的角度回顾现有的nl2sql解决方案，并将其分为四个主要类别（见图LABEL:fig:overview(a)）。然后，我们将重点介绍预训练语言模型（PLMs）和大型语言模型（LLMs）在nl2sql中的最新进展。

• 基准测试和训练数据合成。毫无疑问，基于PLM和LLM的nl2sql模型的表现高度依赖于训练数据的数量和质量。因此，我们将首先总结现有基准的特点并详细分析它们的统计信息（例如，数据库和查询复杂性）。接着，我们将讨论收集和合成高质量训练数据的方法，并强调这是一个研究机会（见图LABEL:fig:overview(b)）。

• 评估。全面评估nl2sql模型对于优化和选择不同使用场景下的模型至关重要。我们将讨论针对nl2sql任务的多角度评估和基于情景的评估（见图LABEL:fig:overview(c)）。例如，可以通过根据SQL特性、NL变体和数据库领域过滤数据集来评估特定情境下的表现。

• NL2SQL错误分析。错误分析在nl2sql研究中识别模型局限性是至关重要的。我们回顾了现有的错误分类法，分析了它们的局限性，并提出了设计nl2sql输出错误分类法的原则。利用这些原则，我们创建了一个两级错误分类法，并用它来汇总和分析nl2sql输出错误（见图LABEL:fig:overview(d)）。

接下来，我们将介绍开发nl2sql解决方案的实际指导，包括我们为优化LLMs至nl2sql任务所设计的路线图，以及我们为选择适合不同nl2sql场景的nl2sql模块而创建的决策流程。最后，我们将介绍一些有趣的且重要的nl2sql领域的开放问题，包括开放世界的nl2sql任务、成本效益的nl2sql与LLMs、以及可信的nl2sql解决方案。

Differences from Existing Surveys. Our survey distinguishes itself from existing nl2sql surveys [22, 23, 24, 25, 26] and tutorials [27, 28, 29] in five aspects.

• We systematically review the entire lifecycle of NL2SQL problem, as shown in Figure LABEL:fig:overview. This lifecycle includes training data collection and synthesis methods (Figure LABEL:fig:overview(b)), various nl2sql translation methodologies (Figure LABEL:fig:overview(a)), multi-angle and scenarios-based evaluations (Figure LABEL:fig:overview(c)), and nl2sql output error analysis techniques (Figure LABEL:fig:overview(d)).

• We provide a more detailed and comprehensive summary of the inherent challenges in nl2sql. Additionally, we analyze the technical challenges when developing a robust nl2sql solution for real-world scenarios, which are often overlooked in other surveys.

• We particularly focus on recent advances in LLM-based nl2sql methods, summarizing key modules and comparing different strategies within this scope. We are the first survey to provide a modular summary of methods and provide detailed analyses for each key module (e.g., database content retrieval).

• We highlight the importance of evaluating nl2sql methods in a multi-angle way, analyze the key nl2sql error patterns, and provide a two-level error taxonomy.

• We provide practitioners with a roadmap for optimizing LLMs to nl2sql task and a decision flow for selecting the suitable nl2sql modules for various usage scenarios.

与现有综述的区别。我们的综述在五个方面区别于现有的nl2sql综述[22, 23, 24, 25, 26]和教程[27, 28, 29]。

• 我们系统地回顾了NL2SQL问题的整个生命周期，如图LABEL:fig:overview所示。此生命周期包括训练数据收集和合成方法（图LABEL:fig:overview(b)）、各种nl2sql翻译方法论（图LABEL:fig:overview(a)）、多角度和基于场景的评估（图LABEL:fig:overview(c)）以及nl2sql输出错误分析技术（图LABEL:fig:overview(d)）。

• 我们对nl2sql固有的挑战进行了更详细和全面的总结。此外，我们分析了开发适用于现实世界场景的稳健nl2sql解决方案时的技术挑战，这在其他综述中经常被忽视。

• 我们特别关注基于LLM的nl2sql方法的最新进展，总结关键模块并在这一范围内比较不同策略。我们是第一个提供模块化方法概要并对每个关键模块（例如，数据库内容检索）进行详细分析的综述。

• 我们强调了以多角度方式评估nl2sql方法的重要性，分析了关键的nl2sql错误模式，并提供了一个两级错误分类法。

• 我们为实践者提供了一条优化LLMs到nl2sql任务的路线图，以及根据不同使用场景选择合适nl2sql模块的决策流程。

Contributions. We make the following contributions.

• NL2SQL with Language Models. We comprehensively review existing nl2sql techniques from a lifecycle perspective (Figure LABEL:fig:overview). We introduce the nl2sql task definition, discuss challenges (Figure 1), provide a taxonomy of nl2sql solutions based on language models (Figure 2), and summarize the key modules of language model-powered nl2sql solutions (Figure 4 and Table I). Next, we elaborate on each module of language model-powered nl2sql methods, including the pre-processing strategies (Section IV), nl2sql translation methods (Section V), and post-processing techniques (Section VI).

• NL2SQL Benchmarks. We review nl2sql benchmarks based on their characteristics (Figure 9). We analyze each benchmark in-depth and present their statistical information (Table II). (Section VII)

• NL2SQL Evaluation and Errors Analysis. We highlight the importance of evaluation in developing practical nl2sql solutions. We review widely used evaluation metrics and toolkits for assessing nl2sql solutions. We provide a taxonomy to summarize typical errors produced by nl2sql methods. (Section VIII)

• Practical Guidance for Developing NL2SQL Solutions. We provide a roadmap for optimizing existing LLMs to nl2sql tasks. (Figure 12(a)). In addition, we design a decision flow to guide the selection of appropriate nl2sql modules for different scenarios (Figure 12(b)).

• Open Problems in NL2SQL. Finally, we discuss new research opportunities, including the open-world nl2sql problem and cost-effective nl2sql solutions (Section X).

• NL2SQL Handbook. We maintain a continuously updated handbook1 for readers to easily track the latest nl2sql techniques in the literature and provide practical guidance for researchers and practitioners.

贡献。我们做出了以下贡献。

• 基于语言模型的NL2SQL。我们从生命周期的角度综合回顾了现有的nl2sql技术（图LABEL:fig:overview）。我们介绍了nl2sql任务定义，讨论了挑战（图1），提供了基于语言模型的nl2sql解决方案的分类（图2），并总结了基于语言模型的nl2sql解决方案的关键模块（图4和表I）。接着，我们详述了每个基于语言模型的nl2sql方法的模块，包括预处理策略（第IV节）、nl2sql翻译方法（第V节）和后处理技术（第VI节）。

• NL2SQL基准。我们基于特征审查了nl2sql基准（图9）。深入分析了每个基准，并呈现了它们的统计信息（表II）。（第VII节）

• NL2SQL评估和错误分析。我们强调了评估在开发实用nl2sql解决方案中的重要性。我们回顾了广泛使用的评估指标和工具包来评估nl2sql解决方案。我们提供了一个分类法来总结nl2sql方法产生的典型错误。（第VIII节）

• 开发NL2SQL解决方案的实际指导。我们为优化现有的LLMs到nl2sql任务提供了一条路线图（图12(a)）。另外，我们设计了一个决策流程来指导不同场景下适当nl2sql模块的选择（图12(b)）。

• NL2SQL的开放问题。最后，我们讨论了新的研究机会，包括开放世界的nl2sql问题和成本效益的nl2sql解决方案（第X节）。

• NL2SQL手册。我们维护了一个持续更新的手册1，供读者轻松跟踪文献中的最新nl2sql技术，并为研究人员和从业者提供实际指导。

Fig. 1: An Overview of the Survey: The Lifecycle of the NL2SQL Task.

XI CONCLUSION

In this paper, we provide a comprehensive review of nl2sql techniques from a lifecycle perspective in the era of LLMs. We formally state the nl2sql task, discuss its challenges, and present a detailed taxonomy of solutions based on the language models they rely on. We summarize the key modules of language model-powered nl2sql methods, covering pre-processing strategies, translation models, and post-processing techniques. We also analyze nl2sql benchmarks and evaluation metrics, highlighting their characteristics and typical errors. Furthermore, we outline a roadmap for practitioners to adapt LLMs for nl2sql solutions. Finally, we maintain an updated online handbook to guide researchers and practitioners in the latest nl2sql advancements and discuss the research challenges and open problems for nl2sql.

本文中，我们在LLMs时代从生命周期的角度对nl2sql技术进行了全面回顾。我们正式定义了nl2sql任务，讨论了它的挑战，并基于所依赖的语言模型提供了详细的解决方案分类。我们总结了基于语言模型的nl2sql方法的关键模块，涵盖了预处理策略、翻译模型和后处理技术。我们还分析了nl2sql基准和评估指标，强调了它们的特点和典型的错误。此外，我们为实践者勾勒出了适应nl2sql解决方案的LLMs路线图。最后，我们维护了一个在线更新的手册，以指导研究人员和从业者掌握最新的nl2sql进展，并讨论了nl2sql的研究挑战和开放问题。