RAG 2.0谷歌RICHES：边检索、边思考、边生成

本文链接：https://blog.csdn.net/m0_59235245/article/details/141504724

RAG 2.0方法由contextual.ai推出，它将预训练、微调和对齐所有组件作为一个单一的集成系统，通过大模型和检索器进行反向传播以最大化性能。旨在解决RAG面临的各个组件技术是有效，但整体远非最佳的问题。

Google DeepMind提出一种新颖的方法RICHES（Retrieval Interlaced with Sequence Generation），通过单一的LLM和解码过程，将文本生成与文档检索原生地交织在一起。无需单独的检索器和生成器，直接解码文档内容或相关的自然语言检索键。无需额外训练，即可通过提示适应多样的新任务。

示例RICHES输出，用于具有单个大型语言模型（LLM）和解码通道的多跳查询。绿色引用文本是从检索语料库中"检索"或逐字生成的。RICHES生成原生地交错了思考和多个检索证据。

RICHES的工作流程太长不看版：

初始化模型：选择一个适合的预训练大型语言模型（LLM）。
定义检索键：确定用于检索的文档标识符，如标题、段落、句子或命题。
构建索引：使用FM-Index等技术为语料库构建索引，优化检索效率。
接收输入：接收用户的问题或查询作为输入。
交替生成：LLM交替进行自由文本生成和受限检索键生成。
应用约束：在生成过程中，利用索引对检索键进行约束，确保它们对应于语料库中的有效文档。
检索文档：根据生成的检索键，从语料库中检索相关文档或信息片段。
整合与输出：将检索到的内容与生成的文本结合，形成完整的回答或解决方案。
评估：使用适当的评估指标（如F1分数、AutoAIS）对输出结果进行评估。
迭代优化：根据评估结果进行模型和流程的迭代改进。

RICHES详细原理：

检索与生成的交织：

RICHES通过直接解码文档内容或相关的自然语言检索键来检索文档，这些检索键指向生成它们的文档。
这种方法允许在一个单一的解码过程中将文本生成与检索交织在一起，从而避免了使用单独的检索器和生成器。

检索键的定义：

检索键是存在于预定义的有限序列集K中的一个标记序列，每个条目都与底层语料库C中的一个或多个文档相关联。
在输出序列y中标记检索键的开始和结束使用特殊标记«and»。

概率模型的更新：

通过引入一个指示函数1K(q)，将标准的自回归语言建模概率Pθ(y|x)更新为包括检索键的概率模型。
该模型通过零化不允许的序列的继续概率来实现约束解码。

约束束解码（Constrained Beam Decoding）：

使用束搜索（Beam Search）作为解码策略，模拟启发式的最优优先搜索。
在每个时间步，LLM估计每个节点（标记）的值，并将其添加到固定大小的队列（束）中。
对查询“马拉松何时更名为士力架？”的约束束可视化。最终RICHES输出为“马拉松在1990年更名为士力架”。加粗的框追踪了顶部束序列的进展。灰色划掉的框是LLM（大型语言模型）偏好但被语料库约束阻止的序列。

通过FM-Index实现高效约束：

使用FM-Index（Ferragina和Manzini，2000）来约束解码过程中的模型输出，确保输出序列在语料库中存在。
FM-Index是一个压缩的后缀数组，支持快速子串搜索操作。

在这里插入图片描述

自适应束大小：

引入自适应解码策略，根据生成的约束序列和非约束序列的不同需求，动态调整束大小。
约束序列需要精确匹配目标检索键，而非约束序列则更灵活。

索引策略：

FM-Index支持高效索引语料库中的所有子串，但文档表示方式的选择对检索效果有重要影响。
RICHES支持多种索引策略，包括文档标题、段落子字符串、句子子字符串和命题索引。

RICHES在开放领域问答（归因问答、多跳问答和检索思考）任务上展现出强大的性能，与传统的检索增强生成方法相比，在多跳问答任务（Hotpot）上表现尤为出色，能够通过单一的解码过程实现更准确的答案生成。

RICHES的综合性能比较。对于密集检索器，检索并获取前k个文档并将其输入到少量射击Answerer中，其中GTR段落的k=1，GTR命题的k=2。对于迭代检索，每个步骤最多检索4个文档，每个步骤的k=1。

RICHES与密集检索在单跳问答（QA）中的示例对比。仅展示检索到的文本以供说明。

来自RICHES的迭代检索输出示例。备注以(# 评论)的形式进行了注释。

在这里插入图片描述

还讨论了RICHES在不同索引策略、束搜索（beam search）大小的对比效果：

命题（Proposition）检索键的效果是最好的

在这里插入图片描述

附录：

用于多跳问答的RICHES的Few-shot Prompt模版

For given input query, write 1-3 passages to answer the query. Write a hint keyword and a passage``contained within « and ». A passage must be a complete sentence and not a phrase. It must contain``complete context for answering the query and should not begin with it, he, they etc. Do not repeat any``passages. Aim for new keywords.``   ``question: The football manager who recruited Cristiano Ronaldo managed Manchester United during``what timeframe?``passage: keyword: Cristiano Ronaldo’s recruiting manager « Alex Ferguson recruited Cristiano Ronaldo »``keyword: Sir Alex Ferguson’s tenure at Manchester United « Sir Alex Ferguson managed Manchester``United from 1986 to 2013. »``answer: 1986 to 2013``   ``question: Were Eatza Pizza and Your Pie founded in the same state?``passage: keyword: Eatza Pizza founded in state « Eatza Pizza was founded in Arizona » keyword: Your``Pie founded in state « Your Pie was founded in Athens, Georgia »``answer: no``   ``question: In which stadium do the teams owned by Myra Kraft’s husband play?``passage: keyword: Myra Kraft’s husband « Robert Kraft’s wife is Myra Kraft. » keyword: Robert Kraft’s``team « Robert Kraft is the owner of the New England Patriots. » keyword: New England Patriots stadium``« Gillette Stadium is the home of the New England Patriots. »``answer: Gillette Stadium``   ``question: <question>``passage:

用于单跳问答的RICHES的Few-shot Prompt模版

For given input query, write 1-3 passages to answer the query. Write a hint keyword and a passage``contained within « and ». A passage must be a complete sentence and not a phrase. It must contain``complete context for answering the query and should not begin with it, he, they etc. Do not repeat any``passages. Aim for new keywords.``   ``question: who is the owner of phoenix mall pune?``passage: keyword: Phoenix Market City owner « Phoenix Market City is developed by Phoenix Mills``Limited. »``answer: Phoenix Mills Limited``   ``question: what brings in more money nba or nfl?``passage: keyword: NFL revenues « NFL revenues are well over $10 billion per season. » keyword: NBA``revenue « NBA amasses about $6 billion annually. »``answer: NFL``   ``question: when was the french national anthem adopted?``passage: keyword: French national anthem « La Marseillaise became the national anthem of France. »``keyword: La Marseillaise adoption « La Marseillaise was adopted by France in 1795. »``answer: 1795``   ``question: question``passage:

从命题中提取答案的Few-shot Prompt模版

Answer the ’question’ only based on the given ’passage’. If the ’passage’ lacks context or is not relevant,``say ’Cannot answer’ else say generate a short answer. Do not answer the query from outside the scope of``the passage.``   ``question: what brings in more money nba or nfl?``passage: NFL revenues are well over $10 billion per season. NBA amasses about $6 billion annually.``answer: NFL``   ``question: when did they put warnings on cigarette packs``passage: Tobacco packaging 1978’s warning was not removed, so now every cigarette pack contains both``warnings (one on each lateral).``answer: Cannot Answer``   ``question: when was the french national anthem adopted?``passage: La Marseillaise became the national anthem of France. La Marseillaise was adopted by France``in 1795.``answer: 1795``   ``question: question``passage: passage``answer:

约束解码过程的说明。给定前缀“Joker is played by”，续接词“Nolan”在语料库中未找到，因此被屏蔽掉。

https://arxiv.org/pdf/2407.00361``From RAG to RICHES: Retrieval Interlaced with Sequence Generation``Google Deepmind

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述