【RAG】表格场景RAG怎么做？TableRAG：一种增强大规模表格理解框架

最新推荐文章于 2025-04-01 19:37:09 发布

余俊晖

最新推荐文章于 2025-04-01 19:37:09 发布

阅读量2.2k

点赞数 8

分类专栏：自然语言处理文档智能 RAG 文章标签： RAG 自然语言处理

本文链接：https://blog.csdn.net/yjh_SE007/article/details/142845298

版权

自然语言处理同时被 3 个专栏收录

106 篇文章

订阅专栏

RAG

44 篇文章

订阅专栏

文档智能

29 篇文章

订阅专栏

前面很多期介绍了密集文档场景的RAG方法，今天来看看大量表格场景的RAG怎么做的。

现有结合大模型的方法通常需要将整个表格作为输入，这会导致一些挑战，比如位置偏差、上下文长度限制等，尤其是在处理大型表格时。为了解决这些问题，文章提出了TableRAG框架，该框架利用查询扩展结合模式和单元格检索，以在向LLM提供信息之前精确定位关键信息。这种方法能够更高效地编码数据和精确检索，显著减少提示长度并减轻信息丢失。

表提示技术在LLM中的应用比较

(a) Read Table

语言模型读取整个表格。这是最直接的方法，但往往不可行，因为大型表格会超出模型的处理能力。阴影区域表示提供给语言模型的数据，包括所有行和列。对于大型表格，这种方法不现实，因为会超过模型的令牌限制。

(b) Read Schema

语言模型只读取表格的模式（schema），即列名和数据类型。只包含列名和数据类型的信息，不包含表格内容的具体信息。这种方法会导致表格内容的信息丢失。

© Row-Column Retrieval

对行和列进行编码，然后根据它们与问题的相似性进行选择。只有行和列的交集被呈现给语言模型。
编码后，基于与问题的相关性选择行和列。
对于大型表格，编码所有行和列仍然不可行。

(d) Schema-Cell Retrieval (Ours)

编码列名和单元格，并根据它们与语言模型生成的关于问题查询的相关性进行检索。只有检索到的模式和单元格提供给语言模型。
包括检索到的列名和单元格值。
提高了编码和推理的效率。

(e) Retrieval Performance on ArcadeQA

展示了在 ArcadeQA 数据集上不同方法的检索结果。TableRAG 在列和单元格检索方面都优于其他方法，从而提高了后续表格推理过程的性能。

方法

TableRAG Example

核心思想是结合模式检索和单元格检索，获得解决问题的必要信息，通过程序辅助的LLM。实际上，没必要将整个表格给LLM。相反，关键信息通常位于与问题直接相关的特定列名、数据类型和单元格值中。例如，考虑一个问题“钱包的平均价格是多少？”为了解决这个问题，程序可能只需要提取与“钱包”相关的行，然后从价格列计算平均值。仅知道相关列名以及表中“钱包”的表示方式就足以编写程序。因此，TableRAG解决了RAG的上下文长度限制。

TableRAG流程图：表格被用来构建Schema和单元格数据库。然后通过LLM将问题扩展成多个模式和单元格查询。这些查询依次用于Schema检索和列-单元格对。每个查询的前K个候选项被组合起来，输入到LLM求解器的提示中以回答问题。

TableRAG核心组件

Tabular Query Expansion(表格查询扩展)

为了有效地操作表格，关键是要精确地找出查询所需的列名和单元格值。与之前的方法不同，TableRAG 不仅使用问题本身作为单一查询，而是为模式和单元格值生成单独的查询。例如，对于问题 “What is the average price for wallets?”，模型被提示生成针对列名（如 “product” 和 “price”）以及相关单元格值（如 “wallet”）的潜在查询。然后，这些查询被用来从表格中检索相关的模式和单元格值。

Schema Retrieval(Schema检索)

在生成查询后，Schema检索会使用预训练的编码器 fenc 来获取相关的列名。编码器将查询与编码的列名进行匹配，以确定相关性。检索到的模式数据包括列名、数据类型和示例值。对于被识别为数值或日期时间类型的列，会显示最小值和最大值作为示例值；对于分类列，会展示三个最常见的类别作为示例值。通过这种方式，检索到的模式为表格的格式和内容提供了结构化的概览，这将用于更有针对性的数据提取。

相关prompt如下：

========================================= Prompt =========================================
You are working with a pandas dataframe regarding "amazon seller order status prediction
orders data" in Python. The name of the dataframe is ‘df‘. Your task is to use ‘
python_repl_ast‘ to answer the question: "What is the average price for leather wallets?"
Tool description:
- ‘python_repl_ast‘: A Python interactive shell. Use this to execute python commands. Input
should be a valid single line python command.
Since you cannot view the table directly, here are some schemas and cell values retrieved
from the table.
Schema Retrieval Results:
{"column_name": "item_total", "dtype": "object", "cell_examples": [’$449.00’, ’$399.00’, ’
$549.00’]}
{"column_name": "quantity", "dtype": "int64", "min": 1, "max": 4}
{"column_name": "order_no", "dtype": "object", "cell_examples": [’402-4845680-8041921’,
’405-9763961-5211537’, ’404-3964908-7850720’]}
...
Cell Retrieval Queries: leather wallets, average price, order status, prediction, amazon
seller
Cell Retrieval Results:
{"column_name": "order_status", "cell_value": "Delivered to buyer"}
{"column_name": "description", "cell_value": "Pure Leather Camel Color Gent’s Wallet with
Coin Compartment and Card Holders | Men’s Ultra Slim Money Organiser (1 pc)"}
...
Strictly follow the given format to respond:
Thought: you should always think about what to do
Action: the single line Python command to execute
Observation: the result of the action
... (this Thought/Action/Observation can repeat N times)
Thought: before giving the final answer, you should think about the observations
Final Answer: the final answer to the original input question (Answer1, Answer2, ...)
Notes:
- Do not use markdown or any other formatting in your responses.
- Ensure the last line is only "Final Answer: Answer1, Answer2, ..." form, no other form.
- Directly output the Final Answer rather than outputting by Python.
- Ensure to have a concluding thought that verifies the table, observations and the
question before giving the final answer.
Now, given a table regarding "amazon seller order status prediction orders data", please
use ‘python_repl_ast‘ with the column names and cell values above to answer the question: "
What is the average price for leather wallets?"
Begin!
======================================= Completion =======================================
Thought: I need to remove the comma in the item_total values before converting them to
float.
Action: df[’item_total’] = df[’item_total’].str.replace(’$’,’’).str.replace(’,’,’’).astype(
float)
Observation: success!
Thought: Now, I can filter out the rows with leather wallets and calculate the average
price.
Action: df[df[’description’].str.contains(’wallet’, case=False)][’item_total’].mean()
Observation: 442.7916666666667
Thought: before giving the final answer, I should consider the observations and ensure that
the calculations are accurate.
Final Answer: The average price for leather wallets is $442.79.