《Advanced RAG》-06-探索RAG技术 Query Rewriting

静愚 AGI

于 2024-08-07 08:00:00 发布

阅读量520

点赞数 17

分类专栏：深度 RAG Medium精选文章标签：人工智能 AIGC

本文链接：https://blog.csdn.net/JingYu_365/article/details/140938221

版权

深度 RAG 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

Medium精选

7 篇文章 0 订阅

订阅专栏

摘要

详细阐述了多种查询重写技术，这些技术用于在检索增强生成（RAG）中优化查询和文档之间的语义匹配。

首先，介绍了假设文档嵌入（HyDE）方法，它通过生成假设文档并将其与原始查询结合，以提高查询向量与实际文档的一致性。

接着，提出了重写-检索-读取（Rewrite-Retrieve-Read）框架，该框架通过重写查询来改善信息检索和回答生成的过程。

此外，还介绍了回退提示法（Step-Back Prompting）技术，它通过提出更抽象的问题来辅助大型语言模型进行更准确的推理。另一种技术是Query2doc，它通过使用大型语言模型生成的伪文档来扩展查询。

最后，ITER-RETGEN方法通过迭代检索和生成步骤来增强查询和回答的质量。网页还提供了相关技术的代码示例和实现细节，并讨论了这些方法的优势和局限性。

文章观点

查询重写对于提高RAG中的查询和文档语义一致性至关重要。
HyDE通过生成假设文档来桥接原始查询与文档库之间的语义差距。
重写-检索-读取框架强调了查询重写在提升信息检索和回答生成流程中的作用。
回退提示法技术通过抽象化问题来帮助模型避免中间推理步骤中的错误。
Query2doc通过生成伪文档来扩展查询，从而提高检索的精确度。
ITER-RETGEN通过迭代的检索和生成过程来不断优化查询和回答的质量。
这些技术虽然提高了查询处理的效果，但也带来了性能权衡和实际应用中的挑战。
除了查询重写，还有其他预检索方法，如查询路由和分解查询，这些方法有望在未来被进一步探索和应用。

在检索增强生成（RAG）中，我们经常会遇到用户原始查询的问题，如措辞不当或缺乏语义信息。例如，"2020 年 NBA 总冠军是洛杉矶湖人队！告诉我什么是 langchain 框架？"这样的查询如果直接搜索，LLM 可能会给出错误或无法回答的回复。

因此，**必须将用户查询的语义空间与文档的语义空间相统一。**查询重写技术可以有效解决这一问题。它在 RAG 中的作用如图 1 所示：

图 1：RAG 中的查询重写（红色虚线框）

从位置的角度来看，查询重写是一种预检索方法。请注意，该图大致说明了查询重写在 RAG 中的位置。下面，将会提到到一些算法可以改进这一过程。

查询重写是调整查询和文档语义的关键技术。例如：

Hypothetical Document Embeddings (HyDE) ：通过假设文档来对齐查询和文档的语义空间。
Rewrite-Retrieve-Read：提出了一个不同于传统的检索和阅读顺序的框架，专注于查询重写。
Step-Back Prompting：允许LLM基于高级概念进行抽象推理和检索。
Query2Doc：使用LLM的一些提示创建伪文档。然后，它将这些查询与原始查询合并，以构建一个新的查询。
ITER-RETGEN：提出了一种将先前生成的结果与先前查询相结合的方法。随后检索相关文档并生成新结果。此过程重复多次以获得最终结果。

让我们深入了解一下这些方法的细节。

Hypothetical Document Embeddings (HyDE)

论文《Precise Zero-Shot Dense Retrieval without Relevance Labels》提出了一种基于假设文档嵌入（HyDE）的方法，其主要过程如图 2 所示。

图 2：HyDE 模型示意图。

该过程主要分为四个步骤：

使用 LLM 根据查询生成 k 个假设文档。这些生成的文档可能与事实不符，可能包含错误，但应该与相关文档相似。这一步的目的是通过 LLM 来解释用户的查询。
将生成的假定文档输入编码器，将其映射为密集向量 f(dk)。我们认为编码器具有过滤功能，可以过滤掉假设文档中的噪音。这里，dk 表示第 k 个生成的文档，f 表示编码器操作。
用给定的公式计算下列 k 个向量的平均值：

在这里插入图片描述

我们还可以将原始查询 q 视为一个可能的假设：

在这里插入图片描述

使用向量 v 从文档库中检索答案。如步骤 3 所述，该向量包含用户查询和所需答案模式的信息，可提高召回率。

我对 HyDE 的理解如图 3 所示。HyDE 的目标是生成假设文档，使最终查询向量 v 与向量空间中的实际文档尽可能接近。

在这里插入图片描述

LlamaIndex 和 Langchain 都实现了 HyDE。下面以 LlamaIndex 为例进行说明。

将此文件放到 YOUR_DIR_PATH。测试代码如下（我安装的 LlamaIndex 版本是 0.10.12）：

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

# Load documents, build the VectorStoreIndex
dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()
index = VectorStoreIndex.from_documents(documents)


query_str = "what did paul graham do after going to RISD"# Query without transformation: The same query string is used for embedding lookup and also summarization.
query_engine = index.as_query_engine()
response = query_engine.query(query_str)

print('-' * 100)
print("Base query:")
print(response)


# Query with HyDE transformation
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)

print('-' * 100)
print("After HyDEQueryTransform:")
print(response)

首先，看看 LlamaIndex 中默认的 HyDE 提示符：

############################################# HYDE##############################################

HYDE_TMPL = (
    "Please write a passage to answer the question\n""Try to include as many key details as possible.\n""\n""\n""{context_str}\n""\n""\n"'Passage:"""\n'
)

DEFAULT_HYDE_PROMPT = PromptTemplate(HYDE_TMPL, prompt_type=PromptType.SUMMARY)

HyDEQueryTransform 类的代码如下。

def _run 函数的目的是生成假设文档，我们在 def _run 函数中添加了三条调试语句，以监控假设文档的内容：

class HyDEQueryTransform(BaseQueryTransform):
    """Hypothetical Document Embeddings (HyDE) query transform.

    It uses an LLM to generate hypothetical answer(s) to a given query,
    and use the resulting documents as embedding strings.

    As described in `[Precise Zero-Shot Dense Retrieval without Relevance Labels]
    (https://arxiv.org/abs/2212.10496)`
    """def __init__(
        self,
        llm: Optional[LLMPredictorType] = None,
        hyde_prompt: Optional[BasePromptTemplate] = None,
        include_original: bool = True,
    ) -> None:
        """Initialize HyDEQueryTransform.

        Args:
            llm_predictor (Optional[LLM]): LLM for generating
                hypothetical documents
            hyde_prompt (Optional[BasePromptTemplate]): Custom prompt for HyDE
            include_original (bool): Whether to include original query
                string as one of the embedding strings
        """super().__init__()

        self._llm = llm or Settings.llm
        self._hyde_prompt = hyde_prompt or DEFAULT_HYDE_PROMPT
        self._include_original = include_original

    def _get_prompts(self) -> PromptDictType:
        """Get prompts."""return {"hyde_prompt": self._hyde_prompt}

    def _update_prompts(self, prompts: PromptDictType) -> None:
        """Update prompts."""if "hyde_prompt" in prompts:
            self._hyde_prompt = prompts["hyde_prompt"]

    def _run(self, query_bundle: QueryBundle, metadata: Dict) -> QueryBundle:
        """Run query transform."""# TODO: support generating multiple hypothetical docs
        query_str = query_bundle.query_str
        hypothetical_doc = self._llm.predict(self._hyde_prompt, context_str=query_str)
        embedding_strs = [hypothetical_doc]
        if self._include_original:
            embedding_strs.extend(query_bundle.embedding_strs)

        # The following three lines contain the added debug statements.print('-' * 100)
        print("Hypothetical doc:")
        print(embedding_strs)

        return QueryBundle(
            query_str=query_str,
            custom_embedding_strs=embedding_strs,
        )

测试代码操作如下

(llamaindex_010) Florian:~ Florian$ python /Users/Florian/Documents/test_hyde.py 
-----------------------------------------------------------------------------------------
Base query:
Paul Graham resumed his old life in New York after attending RISD. He became rich and continued his old patterns, but with new opportunities such as being able to easily hail taxis and dine at charming restaurants. He also started experimenting with a new kind of still life painting technique.
-----------------------------------------------------------------------------------------
Hypothetical doc:
["After attending the Rhode Island School of Design (RISD), Paul Graham went on to co-found Viaweb, an online store builder that was later acquired by Yahoo for $49 million. Following the success of Viaweb, Graham became an influential figure in the tech industry, co-founding the startup accelerator Y Combinator in 2005. Y Combinator has since become one of the most prestigious and successful startup accelerators in the world, helping launch companies like Dropbox, Airbnb, and Reddit. Graham is also known for his prolific writing on technology, startups, and entrepreneurship, with his essays being widely read and respected in the tech community. Overall, Paul Graham's career after RISD has been marked by innovation, success, and a significant impact on the startup ecosystem.", 'what did paul graham do after going to RISD']
------------------------------------------------------------------
After HyDEQueryTransform:
After going to RISD, Paul Graham resumed his old life in New York, but now he was rich. He continued his old patterns but with new opportunities, such as being able to easily hail taxis and dine at charming restaurants. He also started to focus more on his painting, experimenting with a new technique. Additionally, he began looking for an apartment to buy and contemplated the idea of building a web app for making web apps, which eventually led him to start a new company called Aspra.

embedding_strs 是一个包含两个元素的列表。第一个是生成的假设文档，第二个是原始查询。它们被合并成一个列表，以方便向量计算。

在这个例子中，HyDE 通过准确想象保罗-格雷厄姆在 RISD 之后的工作（见假设文档），大大提高了输出质量。这提高了嵌入质量和最终输出。

当然，HyDE 也有一些失败案例。感兴趣的读者可以访问该网页进行测试。

HyDE 似乎是无监督的，HyDE 中没有训练任何模型：生成模型和对比编码器都保持不变。

总之，虽然 HyDE 引入了一种新的查询重写方法，但它也有一些局限性。它不依赖于查询嵌入相似性，而是强调一个文档与另一个文档的相似性。但是，如果语言模型并不精通该主题，它可能不会总是产生最佳结果，从而可能导致错误增加。

Rewrite-Retrieve-Read

这一想法来自论文 “Query Rewriting for Retrieval-Augmented Large Language Models”。该论文认为，原始查询（尤其是在现实世界中）不一定总是最适合由大语言模型进行检索。

因此，本文建议首先使用 LLM 重写查询。然后再进行检索和生成答案，而不是直接从原始查询中检索内容和生成答案，如图 4 (b) 所示。

在这里插入图片描述

为了说明查询重写如何影响上下文检索和预测性能，请看下面这个例子：查询 "2020 年 NBA 总冠军是洛杉矶湖人队！告诉我什么是 langchain 框架？"的查询通过重写得到了准确处理。

这是通过 Langchain 实现的，安装所需的基本库如下：

pip install langchain
pip install openai
pip install langchainhub
pip install duckduckgo-search
pip install langchain_openai

环境配置和库导入

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"

from langchain_community.utilities import DuckDuckGoSearchAPIWrapper
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

构建链并执行简单查询：

def june_print(msg, res):
    print('-' * 100)
    print(msg)
    print(res)


base_template = """Answer the users question based only on the following context:

<context>
{context}
</context>

Question: {question}
"""

base_prompt = ChatPromptTemplate.from_template(base_template)

model = ChatOpenAI(temperature=0)

search = DuckDuckGoSearchAPIWrapper()


def retriever(query):
    return search.run(query)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | base_prompt
    | model
    | StrOutputParser()
)

query = "The NBA champion of 2020 is the Los Angeles Lakers! Tell me what is langchain framework?"

june_print(
    'The result of query:', 
    chain.invoke(query)
)

june_print(
    'The result of the searched contexts:', 
    retriever(query)
)

操作结果如下：

(langchain) Florian:~ Florian$ python /Users/Florian/Documents/test_rewrite_retrieve_read.py 
----------------------------------------------------------------------------------------------------
The result of query:
I'm sorry, but the context provided does not mention anything about the langchain framework.
----------------------------------------------------------------------------------------------------
The result of the searched contexts:
The Los Angeles Lakers are the 2020 NBA Champions!Watch their championship celebration here!Subscribe to the NBA: https://on.nba.com/2JX5gSN Full Game Highli... Aug 4, 2023. The 2020 Los Angeles Lakers were truly one of the most complete teams over the decade. LeBron James' fourth championship was one of the biggest moments of his career. Only two players from the 2020 team remain on the Lakers. In the storied history of the NBA, few teams have captured the imagination of fans and left a lasting ... James had 28 points, 14 rebounds and 10 assists, and the Lakers beat the Miami Heat 106-93 on Sunday night to win the NBA finals in six games. James was also named Most Valuable Player of the NBA ... Portland Trail Blazers star Damian Lillard recently spoke about the 2020 NBA "bubble" playoffs and had an interesting perspective on the criticism the eventual winners, the Los Angeles Lakers, faced. But perhaps none were more surprising than Adebayo's opinion on the 2020 NBA Finals. The Heat were defeated by LeBron James and the Los Angeles Lakers in six games. Miller asked, "Tell me about ...

结果表明，根据搜索到的语境，有关 "langchain "的信息非常少。

现在就开始构建重写器，重写搜索查询。

rewrite_template = """Provide a better search query for web search engine to answer the given question, end the queries with ’**’. 
Question: {x} 
Answer:
"""
rewrite_prompt = ChatPromptTemplate.from_template(rewrite_template)


def _parse(text):
    return text.strip("**")

rewriter = rewrite_prompt | ChatOpenAI(temperature=0) | StrOutputParser() | _parse
june_print(
    'Rewritten query:', 
    rewriter.invoke({"x": query})
)

结果如下

----------------------------------------------------------------------------------------------------
Rewritten query:
What is langchain framework and how does it work?

构建 rewrite_retrieve_read_chain，并使用重写的查询。

rewrite_retrieve_read_chain = (
    {
        "context": {"x": RunnablePassthrough()} | rewriter | retriever,
        "question": RunnablePassthrough(),
    }
    | base_prompt
    | model
    | StrOutputParser()
)

june_print(
    'The result of the rewrite_retrieve_read_chain:', 
    rewrite_retrieve_read_chain.invoke(query)
)

操作结果如下：

-------------------------------------------------------------------------------------------
The result of the rewrite_retrieve_read_chain:
LangChain is a Python framework designed to help build AI applications powered by language models, particularly large language models (LLMs). It provides a generic interface to different foundation models, a framework for managing prompts, and a central interface to long-term memory, external data, other LLMs, and more. It simplifies the process of interacting with LLMs and can be used to build a wide range of applications, including chatbots that interact with users naturally.

至此，通过重写查询，我们成功获得了正确答案。

STEP-BACK PROMPTING

回退提示（STEP-BACK PROMPTING）是一种简单的提示词技术，它使LLMs能够进行抽象，从包含特定细节的实例中提炼出高级概念和基本原理。这个想法是将“回退问题”定义为从原始问题派生出的更抽象的问题。

例如，如果查询包含大量细节，LLM 就很难检索到相关事实来解决任务。如图 5 中的第一个示例所示，对于物理问题 “如果温度增加 2 倍，体积增加 8 倍，理想气体的压强 P 会发生什么变化？”，LLM 在推理时可能会偏离理想气体定律的第一原理。在直接推理该问题时，LLM 可能会偏离理想气体定律的第一原理。

同样，"埃斯特拉-利奥波德在 1954 年 8 月至 1954 年 11 月期间去了哪所学校？"这个问题也因具体时间范围的限制而难以直接回答。

图 5：通过概念和原理引导的抽象和推理两个步骤来说明 "后退式引导"。

在这两种情况下，提出一个更宽泛的问题可以帮助模型有效地回答具体的询问。

我们可以询问 “埃斯特拉-利奥波德的教育史”，而不是直接询问 “埃斯特拉-利奥波德在特定时间就读于哪所学校”。

这个更宽泛的话题包含了原问题，可以提供所有必要的信息来推断 “埃斯特拉-利奥波德在特定时间就读于哪所学校”。值得注意的是，这些更宽泛的问题通常比原来的具体问题更容易回答。

在图 5（左）所示的 "思维链 "中间步骤中，根据这些抽象概念进行推理有助于防止出错。

总之，后退式诱导包括两个基本步骤：

抽象：最初，我们提示LLM提出一个关于高级概念或原则的广泛问题，而不是直接回应查询。然后，我们检索关于所述概念或原则的相关事实。
推理：LLM可以根据这些关于高级概念或原则的事实推导出原始问题的答案。我们称之为抽象推理。

为了说明后退提示如何影响上下文检索和预测性能，下面是使用 Langchain 实现的演示代码。

环境配置和import

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
from langchain_core.runnables import RunnableLambda
from langchain_openai import ChatOpenAI
from langchain_community.utilities import DuckDuckGoSearchAPIWrapper

构建链并执行原始查询：

def june_print(msg, res):
    print('-' * 100)
    print(msg)
    print(res)


question = "was chatgpt around while trump was president?"

base_prompt_template = """You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

{normal_context}

Original Question: {question}
Answer:"""

base_prompt = ChatPromptTemplate.from_template(base_prompt_template)

search = DuckDuckGoSearchAPIWrapper(max_results=4)
def retriever(query):
    return search.run(query)

base_chain = (
    {
        # Retrieve context using the normal question (only the first 3 results)
        "normal_context": RunnableLambda(lambda x: x["question"]) | retriever,
        # Pass on the question
        "question": lambda x: x["question"],
    }
    | base_prompt
    | ChatOpenAI(temperature=0)
    | StrOutputParser()
)


june_print('The searched contexts of the original question:', retriever(question))
june_print('The result of base_chain:', base_chain.invoke({"question": question}) )

结果

(langchain) Florian:~ Florian$ python /Users/Florian/Documents/test_step_back.py 
----------------------------------------------------------------------------------------------------
The searched contexts of the original question:
While impressive in many respects, ChatGPT also has some major flaws. ... [President's Name]," refused to write a poem about ex-President Trump, but wrote one about President Biden ... The company said GPT-4 recently passed a simulated law school bar exam with a score around the top 10% of test takers. By contrast, the prior version, GPT-3.5, scored around the bottom 10%. The ... These two moments show how Twitter's choices helped former President Trump. ... With ChatGPT, which launched to the public in late November, users can generate essays, stories and song lyrics ... Donald Trump is asked a question—say, whether he regrets his actions on Jan. 6—and he answers with something like this: " Let me tell you, there's nobody who loves this country more than me ...
----------------------------------------------------------------------------------------------------
The result of base_chain:
Yes, ChatGPT was around while Trump was president. ChatGPT is an AI language model developed by OpenAI and was launched to the public in late November. It has the capability to generate essays, stories, and song lyrics. While it may have been used to write a poem about President Biden, it also has the potential to be used in various other contexts, including generating responses from hypothetical scenarios involving former President Trump.

结果显然是不正确的。

开始构建 step_back_question_chain 和 step_back_chain，以获得正确结果。

# Few Shot Examples
examples = [
    {
        "input": "Could the members of The Police perform lawful arrests?",
        "output": "what can the members of The Police do?",
    },
    {
        "input": "Jan Sindel’s was born in what country?",
        "output": "what is Jan Sindel’s personal history?",
    },
]
# We now transform these to example messages
example_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
        ("ai", "{output}"),
    ]
)
few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)

step_back_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer. Here are a few examples:""",
        ),
        # Few shot examples
        few_shot_prompt,
        # New question
        ("user", "{question}"),
    ]
)
step_back_question_chain = step_back_prompt | ChatOpenAI(temperature=0) | StrOutputParser()
june_print('The step-back question:', step_back_question_chain.invoke({"question": question}))
june_print('The searched contexts of the step-back question:', retriever(step_back_question_chain.invoke({"question": question})) )



response_prompt_template = """You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

{normal_context}
{step_back_context}

Original Question: {question}
Answer:"""
response_prompt = ChatPromptTemplate.from_template(response_prompt_template)


step_back_chain = (
    {
        # Retrieve context using the normal question
        "normal_context": RunnableLambda(lambda x: x["question"]) | retriever,
        # Retrieve context using the step-back question
        "step_back_context": step_back_question_chain | retriever,
        # Pass on the question
        "question": lambda x: x["question"],
    }
    | response_prompt
    | ChatOpenAI(temperature=0)
    | StrOutputParser()
)

june_print('The result of step_back_chain:', step_back_chain.invoke({"question": question}) )

结果如下

----------------------------------------------------------------------------------------------------
The step-back question:
When did ChatGPT become available?
----------------------------------------------------------------------------------------------------
The searched contexts of the step-back question:
OpenAI released an early demo of ChatGPT on November 30, 2022, and the chatbot quickly went viral on social media as users shared examples of what it could do. Stories and samples included ... March 14, 2023 - Anthropic launched Claude, its ChatGPT alternative. March 20, 2023 - A major ChatGPT outage affects all users for several hours. March 21, 2023 - Google launched Bard, its ... The same basic models had been available on the API for almost a year before ChatGPT came out. In another sense, we made it more aligned with what humans want to do with it. A paid ChatGPT Plus subscription is available. (Image credit: OpenAI) ChatGPT is based on a language model from the GPT-3.5 series, which OpenAI says finished its training in early 2022.
----------------------------------------------------------------------------------------------------
The result of step_back_chain:
No, ChatGPT was not around while Trump was president. ChatGPT was released to the public in late November, after Trump's presidency had ended. The references to ChatGPT in the context provided are all dated after Trump's presidency, such as the release of an early demo on November 30, 2022, and the launch of ChatGPT Plus subscription. Therefore, it is safe to say that ChatGPT was not around during Trump's presidency.