CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

最新推荐文章于 2025-04-13 20:00:00 发布

NEU_LightBulb

最新推荐文章于 2025-04-13 20:00:00 发布

阅读量1.5k

点赞数 18

分类专栏： Text2SQL 文章标签：自然语言处理人工智能 sql

本文链接：https://blog.csdn.net/NEU_LightBulb/article/details/143105680

版权

Text2SQL 专栏收录该内容

1 篇文章

订阅专栏

这里写自定义目录标题

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

原文地址.

Abstract

主要简单概括了作者在Text-to-SQL任务上的创新策略以及在BIRD和Spider上的优秀表现。
创新策略：

在一次LLM调用中将复杂的查询分解为可管理的子查询
基于查询执行计划的思链推理，这样可以映射到数据库引擎在执行过程中采取的步骤
一种特殊的Few-shot示例的选择技术
采用微调过的 binary-candidate selection LLM 来对候选集进行排序
作者提出的 CHASE-SQL 在BIRD 数据集基准的测试集和开发集上实现了 73.0% 和 73.01%

Introduction

主要说明Text-to-SQL任务的意义与价值、任务特性、前人的一些探索。
其中特别说明了之前的工作在“候选集选择”、“候选集排序”以及“Self-consistency”使用的局限性，这一切都拉低了模型效果的上限。

Related Work

再次CallBack前人的工作

模型结构发展
Pipeline技巧变化。从Prompt的优化到CoT、Self-correction、Self-consistency的引入，再到复杂的pipeline的构建。

Methods

CHASE-SQL 框架概述

Value Retrieval

和 CHESS 所讨论的一样，“数据库列值长什么样，哪些和问题相关”这些非常重要，是其他方法给不到LLM的。这里作者也是首先使用Few-shot 的Prompt让LLM提取出输入问题中包含的关键信息，对于选择则是采用LSH（locality-sensitive hashing）进行召回，采用Embedding相似度+编辑距离的方式重排。

Multi-path Candidate Generation

如框架图所示，这里采用了DC-CoT（Divide and Conquer Chain-of-Thought）、QP-CoT（Query Plan Chain-of-Thought）和OC（online synthetic）。之所以采用多种方式就是为了“提高候选集的多样性”。

DC-CoT

在这里插入图片描述
Input部分很简单，就是表结构和输入，如果有外部知识的话也可以输入。

Database Info
{DATABASE_SCHEMA}
**************************
Question
Question: What is the gender of the youngest client who opened account in the lowest average salary branch?
Hint: Given that Later birthdate refers to younger age; A11 refers to average salary
**************************

输出分为三个部分。首先是“Divide and Conquer”部分，这部分主要就是将原始问题“Main Question”分解成一堆子问题“Sub-question”。如果子问题还是很复杂，也可以继续拆分子子问题，颗粒度可以以子问题能用正常SQL来表示为止。

这个SQL不一定是要出现在最终的SQL里，可以理解为用Easy或者Medium的难度且没有公式计算之类的就行。

然后，每一个问题（不管是不是子问题）都配上对应的CoT解释“*Analysis”以及解释对应的伪SQL代码“Pseudo SQL”。在解释中尽可能的使用一些话术，如“对应知识XXX我们应该”、“问题提到了XXX是应该在XXX里的”、“XXX可能指的是XXX”、“指标XXX的计算应该是XXX”等来进行引导。在伪SQL代码中还原解释的内容，对于复杂的需要继续拆解的部分用简单的“代词”文本，并且在子问题的内容和解释中Call back这些代词。

**1. Divide and Conquer:**
* **Main Question:** What is the gender of the youngest client who opened account in the lowest average salary branch?
* **Analysis:** Question asking about ‘gender‘, and it appears in table ‘client‘. We will use this as the output column, selecting it from the youngest client in the lowest average salary branch.
* **Pseudo SQL:** SELECT ‘T1‘.‘gender‘ FROM ‘client‘ AS ‘T1‘ WHERE youngest client in the lowest average salary branch
* **Sub-question 1:** youngest client in the lowest average salary branch
* **Analysis:** According to the hint, we need to use the ‘A11‘ from ‘district‘ to get the salary info, and the youngest client can be obtained from using the ‘birth_date‘ column of table ‘client‘. The items between these two tables can be INNER JOIN using district_id.
* **Pseudo SQL:** SELECT ‘T1‘.‘client_id‘ FROM ‘client‘ AS ‘T1‘ INNER JOIN ‘district‘ AS ‘T2‘ ON ‘T1‘.‘district_id‘ = ‘T2‘.‘district_id‘ WHERE lowest average salary branch ORDER BY ‘T1‘.‘birth_date‘ DESC NULLS LAST LIMIT 1
* **Sub-question 1.1:** lowest average salary branch
* **Analysis:** We can get the lowest average salary branch using order by ‘A11‘ ASC and pick top 1. The column ‘A11‘ is not NULLABLE, so we do not need to add "IS NOT NULL" filter 
* **Pseudo SQL:**
SELECT ‘district_id‘ FROM ‘district‘ ORDER BY ‘A11‘ ASC LIMIT 1

第二部分是“Assembling SQL”，差不多就是第一部分的反过来。在第一部分中我们拆解了问题并得到对应部分的伪SQL，现在把他们组装起来

**2. Assembling SQL:**
* **Sub-question 1.1 (lowest average salary branch):** * **SQL:** SELECT ‘district_id‘ FROM ‘district‘ ORDER BY ‘A11‘ ASC LIMIT 1
* **Sub-question 1 (youngest client in the lowest average salary branch):**
* **SQL:** SELECT ‘T1‘.‘client_id‘ FROM ‘client‘ AS ‘T1‘ INNER JOIN ‘district‘ AS ‘T2‘ ON ‘T1‘.‘district_id‘ = ‘T2‘.‘district_id‘ WHERE ‘T2‘.‘district_id‘ IN (SELECT ‘district_id‘ FROM ‘district‘ ORDER BY ‘A11‘ ASC LIMIT 1) ORDER BY ‘T1‘.‘birth_date‘ DESC NULLS LAST LIMIT 1
* **Main Question (gender of the client):**
* **SQL:** SELECT ‘T1‘.‘gender‘ FROM ‘client‘ AS ‘T1‘ WHERE ‘T1‘.‘client_id‘ = (SELECT ‘T1‘.‘client_id‘ FROM ‘client‘ AS ‘T1‘ INNER JOIN ‘district‘ AS ‘T2‘ ON ‘T1‘.‘district_id‘ = ‘T2‘.‘district_id‘ WHERE ‘T2‘.‘district_id‘ IN (SELECT ‘district_id‘ FROM ‘district‘ ORDER BY ‘A11‘ ASC LIMIT 1) ORDER BY ‘T1‘.‘birth_date‘ DESC NULLS LAST LIMIT 1)

最后一部分“Simplification and Optimization”，是一个查缺补漏以及优化结构的阶段。

**3. Simplification and Optimization:**
* The nested queries can be combined using a single ‘INNER JOIN‘ and the filtering can be done within a single ‘ORDER BY‘ clause.
**Final Optimized SQL Query:**
SELECT ‘T1‘.‘gender‘ FROM ‘client‘ AS ‘T1‘ INNER JOIN ‘district‘ AS ‘T2‘ ON ‘T1‘.‘district_id‘ = ‘T2‘.‘district_id‘ ORDER BY ‘T2‘.‘A11‘ ASC, ‘T1‘.‘birth_date‘ DESC NULLS LAST LIMIT 1

QP-CoT

QP-CoT并不是DC的“下游”而是“补充”，作者提到“仿照数据库执行计划的CoT逻辑在联表问题上有优势，可以补充DC在问题需要关系推理时的弱势”。
QP-CoT的输入也很简单，和DC几乎一样——表结构、输入、外部知识

Database Info
{DATABASE_SCHEMA}
**************************
Answer Repeating the question and evidence, and generating the SQL with a query plan.
**Question**: How many Thai restaurants can be found in San Pablo Ave, Albany?
**Evidence**: Thai restaurant refers to food_type = ’thai’; San Pablo Ave Albany refers to street_name = ’san pablo ave’ AND T1.city = ’albany’

QP-CoT的输出就复杂了，按照数据库执行计划的逻辑：

识别和查找与问题相关的表
执行counting, filtering, or matching等操作
通过选择要返回的适当列来提供最终结果

**Query Plan**:
** Preparation Steps:**
1. Initialize the process: Start preparing to execute the query.
2. Prepare storage: Set up storage space (registers) to hold temporary results, initializing them to NULL.
3. Open the location table: Open the location table so we can read from it.
4. Open the generalinfo table: Open the generalinfo table so we can read from it.
** Matching Restaurants:**
1. Start reading the location table: Move to the first row in the location table.
2. Check if the street matches: Look at the street_name column of the current row in location. If it’s not "san pablo ave," skip this row.
3. Identify the matching row: Store the identifier (row ID) of this location entry.
4. Find the corresponding row in generalinfo: Use the row ID from location to directly find the matching row in generalinfo.
5. Check if the food type matches: Look at the food_type column in generalinfo. If it’s not "thai," skip this row.
6. Check if the city matches: Look at the city column in generalinfo. If it’s not "albany," skip this row.
** Counting Restaurants:**
1. Prepare to count this match: If all checks pass, prepare to include this row in the final count.
2. Count this match: Increment the count for each row that meets all the criteria.
3. Move to the next row in location: Go back to the location table and move to the next row, repeating the process until all rows are checked.
4. Finalize the count: Once all rows have been checked, finalize the count of matching rows.
5. Prepare the result: Copy the final count to prepare it for output.
** Delivering the Result:**
1. Output the result: Output the final count, which is the number of restaurants that match all the specified criteria.
2. End the process: Stop the query execution process.
3. Setup phase: Before starting the actual query execution, the system prepares the specific values it will be looking for, like "san pablo ave," "thai," and "albany."
**Final Optimized SQL Query:**
SELECT COUNT(T1.id_restaurant) FROM generalinfo AS T1 INNER JOIN location AS T2 ON T1.id_restaurant = T2.id_restaurant WHERE T1.food_type = ’thai’ AND T1.city = ’albany’ AND T2.street_name = ’san pablo ave’

OC

在这里插入图片描述
首先，Few-shot已经证明在Text2SQL任务上是有效的（应用自Din-sql ），基于此作者创新提出了一种策略，相当于是让大模型基于Few-shot的召回结果从两个角度去生成“更有效更多样”的shot。
第一种是基于原始shot，从SQL常见的结构特征角度生成新shot。旨在让大模型Get到这类问题的各种变体的写法。这里的TARGET_DATABASE_SCHEMA包含所有的结构

You are a SQLite SQL expert. Your job is to create {k} examples, where each example consists of a question and a SQL query to fetch the data for it. I want each example to look like this, question input and SQL output pairs:
"input": "What’s the description of the series code SM.POP.TOTL for Aruba?
(Hints: Aruba is the name of the country where ShortName = ’Aruba’)"
"output": "SELECT T2.Description FROM Country AS T1 INNER JOIN CountryNotes AS T2 ON T1.CountryCode = T2.Countrycode WHERE T1.ShortName =  ’Aruba’ AND T2.Seriescode = ’SM.POP.TOTL’"
You should generate examples that examine and showcase different aspects and relationships of the following table schemas, described in "Table creation statements". Understand the database tables and their relationships. Understand the columns and their types and meanings to construct intresting examples.
Generate a mixture of SQL examples that include:
• some simple SQL query examples without JOIN
• some SQL query examples with aggregates, like COUNT
• some simple SQL query examples with JOIN
• some complex SQL query examples with nested JOIN
**************************
Table creation statements
{TARGET_DATABASE_SCHEMA}
**************************
Generate total of {k} examples. Only outputs the examples (question input and SQL output pairs), and each example can be separated by a new line.

第二种是基于原始shot，从经过选择的表结构（选择方式类似 CHESS 的列选择方式得到t）角度生成新的shot。旨在提高大模型对表结构的理解。这里的TARGET_DATABASE_SCHEMA只包含被选择的结构。

You are a SQLite SQL expert. Your job is to create a set of examples, where each example consists of a question and a SQL query to fetch the data for it.
You should generate examples that examine and showcase different aspects and relationships of the following table schemas. Understand the database tables and their relationships. Understand the columns and their types and meanings to construct intresting examples.
I will also show you multiple examples generated for the other database and its table schemas, so you can see what kind of examples can be generated for a given database.
**************************
###Examples from other database### The following is the table schemas and column examples for
other database:
The database (¨{TRAIN_DATABASE_NAME}¨) structure is defined by the following table schemas (comments after ’–’ provide additional column descriptions).
{TRAIN_DATABASE_SCHEMA}
————————–
The folloiwing are the examples generated for the above database schemas:
Example 1) "input": "Among the countries in the group of Heavily Indebted Poor Countries, how many of them are under the lending category of the International Development Associations? (Hints: group of Heavily Indebted Poor Countries is OtherGroups = ’HIPC’; International Development Associations refers to lendingcategory = ’IDA’)"
"output": "SELECT COUNT(CountryCode) FROM Country WHERE LendingCategory = ’IDA’ AND OtherGroups = ’HIPC’"
...
Example 10) "input": "What is the description of the footnote on the series code AG.LND.FRST.K2 in 1990 for Aruba? (Hints: Year = 1990; Aruba is the name of country where ShortName = ’Aruba’)"
"output": "SELECT T2.Description FROM Country AS T1 INNER JOIN FootNotes AS T2 ON T1.CountryCode = T2.Countrycode WHERE T1.ShortName = ’Aruba’ AND T2.Seriescode = ’AG.LND.FRST.K2’ AND T2.Year = ’YR1990’"
**************************
Now similarly, generate examples (question input and SQL output pairs) for the table schemas defined below, in "Table creation statements".
**************************
###Table creation statements###
TARGET_DATABASE_SCHEMA
**************************
Only outputs the examples (question input and SQL output pairs), and each example can be separated by a new line.

Query Fixer

类似于Self-Correction，让大模型根据执行错误来修复SQL。特别的，引入self-reflection来让大模型分析错误原因。

**Task Description:**
You are an SQL database expert tasked with correcting a SQL query. A previous attempt to run a query did not yield the correct results, either due to errors in execution or because the result returned was empty or unexpected. Your role is to analyze the error based on the provided database schema and the details of the failed execution, and then provide a corrected version of the SQL query.
**Procedure:**
1. Review Database Schema:
- Examine the table creation statements to understand the database structure.
2. Analyze Query Requirements:
- Original Question: Consider what information the query is supposed to retrieve.
- Hint: Use the provided hints to understand the relationships and conditions relevant to the query.
- Executed SQL Query: Review the SQL query that was previously executed and led to an error or incorrect result.
- Execution Result: Analyze the outcome of the executed query to identify why it failed (e.g., syntax errors, incorrect column references, logical mistakes).
3. Correct the Query:
- Modify the SQL query to address the identified issues, ensuring it correctly fetches the requested data according to the database schema and query requirements.
**Output Format:**
Present your corrected query as a single line of SQL code, after Final Answer. Ensure there are no line breaks within the query.
Here are some examples:
{EXAMPLES}
======= Your task =======
**************************
Table creation statements
{DATABASE_SCHEMA}
**************************
The original question is:
Question:
{QUESTION}
Evidence:
{HINT}
The SQL query executed was:
{QUERY}
The execution result:
{RESULT}
**************************
Based on the question, table schema and the previous query, analyze the result try to fix the query.

Selection Agent

上面介绍了多种生成候选SQL的方式，那么接下来难题就来到了“如何选择？”。传统的方式固然是Self-consistency，但是假设“最一致的就是最正确的”过于片面，所以作者提出了一种更精细的策略。

其实我猜测是因为这种方式生成的候选SQL很难有一致的，“多样性”的初衷嘛

在这里插入图片描述
假设有n个候选SQL，首先为每个候选SQL初始化一个零分，然后每次选择两个送给训练过的分类模型进行比较。比较的过程中会考虑“A比B”与“B比A”可能产生的顺序误差。最终就会像打擂台一样选择出来一个最好的SQL。

Instruction:
Given the DB info and question, there are two candidate queries. There is correct one and incorrect one, compare the two candidate answers, analyze the differences of the query and the result. Based on the original question and the provided database info, choose the correct one.
**************************
Database Schema
{DATABASE_SCHEMA}
**************************
Question:
{QUESTION}
Evidence:
{HINT}
**************************
Candidate A
{CANDIDATE_A_QUERY}
Execution result
{CANDIDATE_A_RESULT}
**************************
Candidate B
{CANDIDATE_B_QUERY}
Execution result
{CANDIDATE_B_RESULT}
Just output the correct answer "A" or "B".

Experiments

稍后更新