《Advanced RAG》-08-探索RAG之Self-RAG

静愚 AGI

于 2024-08-08 08:00:00 发布

阅读量1k

点赞数 25

分类专栏：深度 RAG Medium精选文章标签：人工智能 AIGC 语言模型

本文链接：https://blog.csdn.net/JingYu_365/article/details/140938485

版权

深度 RAG 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

Medium精选

8 篇文章 0 订阅

订阅专栏

摘要

文章首先通过开放书本考试的例子，阐述了两种应对策略：

一是对熟悉的问题直接回答，对不熟悉的问题查阅参考书籍；
二是每个问题都参考书籍。

作者指出，第一种策略更为高效，而第二种策略容易导致信息过载和错误。

接着，文章对比了RAG和自我RAG的主要流程，自我RAG通过按需检索、并行生成和评估选择三个步骤，提高了生成质量、事实性和可验证性。

文章详细介绍了自我RAG的三个步骤：按需检索（根据需要决定是否检索外部信息）、并行生成（同时使用提示和检索内容生成输出，并通过反射标记表示检索内容的相关性）和评估和选择（对生成的内容进行评估，选择最佳片段作为输出）。

作者还提到，自我RAG使用了四种反射标记来进行更精确的控制，这些标记包括**[Retrieve]、[IsREL]、[IsSUP]和[IsUSE]**，分别对应不同的判断和评估过程。

文章进一步提供了代码实现的解释，包括环境配置、测试代码和如何训练Llama2–7B模型。通过测试代码的分析，读者可以理解到Self-RAG的实现过程，以及如何使用LlamaIndex库中的SelfRAGQueryEngine类来执行查询。

Self-RAG虽然提高了生成的质量和精确度，但也增加了训练和推理的复杂性和成本。作者提出了几个优化和改进的方向，并对自我RAG与其他技术如RLHF的关系进行了探讨。

文章观点

Self-RAG的优势：Self-RAG通过引入反射标记，能够更灵活地控制文本生成过程，提高了生成的质量、事实性和可验证性。
反射标记的作用：反射标记是Self-RAG的核心，它们使得模型能够在生成过程中进行更精确的判断和评估。
训练过程的复杂性：Self-RAG的训练过程比传统的RAG更为复杂，需要批评者模型和生成器模型共同协作，并且在生成阶段需要进行多重标签的生成和判断。
推理成本的提升：由于Self-RAG的复杂性，它在推理阶段的成本较高，这可能会影响实时性能要求的项目。
优化和创新的可能性：Self-RAG框架内还有很大的优化空间，包括减少反射标记的数量、选择合适尺寸的批评者模型，以及探索与RLHF等其他技术的结合方式。

本文从一个常见的场景开始：参加开卷考试。

我们通常有两种策略：

方法一：对于熟悉的题目，快速作答；对于不熟悉的题目，打开参考书查找，快速找到相关部分，在脑海中进行分类、归纳，然后在试卷上作答。
方法 2：每个题目都要参考书目。找到相关章节，在头脑中进行整理和归纳，然后在试卷上写下你的答案。

显然，方法 1 是首选方法。方法 2 可能会耗费时间，并有可能引入不相关或错误的信息，这可能会导致混乱和错误，甚至是在你原本理解的领域。

不过，方法 2 体现了经典的 RAG 过程，而方法 1 则代表了Self-RAG过程，本文将对此作进一步讨论。

概述

图 1 比较了 RAG 和自我 RAG 的主要流程：

在这里插入图片描述

Self-RAG 包括三个步骤：

按需检索：当模型需要检索时，例如查询 "美国各州的名称是怎么来的？(图 1 右上方），模型的输出将包含一个 [Retrieve] 标记。这表明需要检索与查询相关的内容。相反，当要求写 “写一篇关于你最棒的暑假的文章”（图 1 右下方）时，模型会选择直接生成答案，而不进行检索。
并行生成：该模型同时使用提示和检索内容来生成输出。在整个过程中，有三种反映标记会显示检索内容的相关性。
评估和选择：对步骤 2 中生成的内容进行评估，并选择最佳片段作为输出。

请注意，上述模型是经过专门训练的模型。本文稍后将讨论其训练过程。

Reflection Tokens

如图 2 所示，与 RAG 相比，self-RAG 框架的不同之处在于它在生成过程中使用反射标记进行更精确的控制。

在这里插入图片描述

从本质上讲，self-RAG 可以做出四种不同的判断：

[Retrieve]：决定是否从资源 R 中检索信息的决策过程。
[IsREL]：相关性检查，以确定给定数据 d 是否包含解决问题 x 所需的信息。
[IsSUP]：一个验证过程，用于检查数据 d 是否支持所提供响应 y 中的陈述。
[IsUSE]：输出分值从 1 到 5 不等，其中 5 代表最高有用性。

在 RAG 中，检索是一个固定的过程，无论条件如何，总是在初始时进行。相比之下，self-RAG 引入了反思标记，使 LLM 更具适应性和智能性。当 LLM 生成文本并遇到不确定区域时，它会在反射标记处暂停，执行快速而精确的检索，然后利用新获得的信息继续生成文本。

代码解释

为了直观地理解self-RAG 过程，我们将首先检查代码，然后讨论模型的训练过程。

Self-RAG 是开源的，Langchain 和 LlamaIndex 都有各自的实现。我们将以 LlamaIndex 的实现作为解释的参考。

环境配置

首先，配置环境。

(base) Florian@instance-1:~$ conda create -n llamaindex python=3.11
(base) Florian@instance-1:~$ conda activate llamaindex

(llamaindex) Florian@instance-1:~$ pip install llama-index
(llamaindex) Florian@instance-1:~$ pip install huggingface-hub

(llamaindex) Florian@instance-1:~$ huggingface-cli login

安装完成后，LlamaIndex 的相应版本如下：

llama-index                             0.10.20
llama-index-core                        0.10.20.post2

下载论文提供的 Llama2-7B 模型，约 4.08G。您也可以从这里下载。

(llamaindex) Florian@instance-1:~$ huggingface-cli download m4r1/selfrag_llama2_7b-GGUF selfrag_llama2_7b.q4_k_m.gguf --local-dir "YOUR_DOWNLOAD_MODEL_DIR" --local-dir-use-symlinks False

(llamaindex) Florian@instance-1:~$ ls "YOUR_DOWNLOAD_MODEL_DIR"
selfrag_llama2_7b.q4_k_m.gguf

测试代码

测试代码如下。第一次执行需要下载 SelfRAGPack。

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"from llama_index.core import Document, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.readers import SimpleDirectoryReader
from pathlib import Path


# Option: download SelfRAGPack# The first execution requires the download of SelfRAGPack. # Subsequent executions can comment this out.from llama_index.core.llama_pack import download_llama_pack
download_llama_pack(
    "SelfRAGPack",
    "./self_rag_pack")

from llama_index.packs.self_rag import SelfRAGQueryEngine

# The directory where the Llama2 model was previously downloaded and saved.
download_dir = "YOUR_DOWNLOAD_MODEL_DIR"# Create testing documents
documents = [
    Document(
        text="A group of penguins, known as a 'waddle' on land, shuffled across the Antarctic ice, their tuxedo-like plumage standing out against the snow."
    ),
    Document(
        text="Emperor penguins, the tallest of all penguin species, can dive deeper than any other bird, reaching depths of over 500 meters."
    ),
    Document(
        text="Penguins' black and white coloring is a form of camouflage called countershading; from above, their black back blends with the ocean depths, and from below, their white belly matches the bright surface."
    ),
    Document(
        text="Despite their upright stance, penguins are birds that cannot fly; their wings have evolved into flippers, making them expert swimmers."
    ),
    Document(
        text="The fastest species, the Gentoo penguin, can swim up to 36 kilometers per hour, using their flippers and streamlined bodies to slice through the water."
    ),
    Document(
        text="Penguins are social birds; many species form large colonies for breeding, which can number in the tens of thousands."
    ),
    Document(
        text="Intriguingly, penguins have excellent hearing and rely on distinct calls to identify their mates and chicks amidst the noisy colonies."
    ),
    Document(
        text="The smallest penguin species, the Little Blue Penguin, stands just about 40 cm tall and is found along the coastlines of southern Australia and New Zealand."
    ),
    Document(
        text="During the breeding season, male Emperor penguins endure the harsh Antarctic winter for months, fasting and incubating their eggs, while females hunt at sea."
    ),
    Document(
        text="Penguins consume a variety of seafood; their diet mainly consists of fish, squid, and krill, which they catch on their diving expeditions."
    ),
]

index = VectorStoreIndex.from_documents(documents)

# Setup a simple retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)


model_path = Path(download_dir) / "selfrag_llama2_7b.q4_k_m.gguf"
query_engine = SelfRAGQueryEngine(str(model_path), retriever, verbose=True)

# No retreival example
response = query_engine.query("Which genre the book pride and prejudice?")

# Retreival example
response = query_engine.query("How tall is the smallest penguins?")

测试代码产生了以下结果（大部分 llama_cpp 调试信息已被删除）：

...
...
Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '11008', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '32', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}
Using fallback chat format: None

llama_print_timings:        load time =    4887.53 ms
llama_print_timings:      sample time =      11.29 ms /    22 runs   (    0.51 ms per token,  1947.76 tokens per second)
llama_print_timings: prompt eval time =    4887.46 ms /    24 tokens (  203.64 ms per token,     4.91 tokens per second)
llama_print_timings:        eval time =    5883.27 ms /    21 runs   (  280.16 ms per token,     3.57 tokens per second)
llama_print_timings:       total time =   10901.84 ms /    45 tokens
Final answer: The book "Pride and Prejudice" is a romantic novel by Jane Austen.
...
...
llama_print_timings:        load time =    4887.53 ms
llama_print_timings:      sample time =      11.74 ms /    20 runs   (    0.59 ms per token,  1703.29 tokens per second)
llama_print_timings: prompt eval time =    7473.66 ms /    37 tokens (  201.99 ms per token,     4.95 tokens per second)
llama_print_timings:        eval time =    5414.34 ms /    19 runs   (  284.96 ms per token,     3.51 tokens per second)
llama_print_timings:       total time =   13076.88 ms /    56 tokens
Input: ### Instruction:
How tall is the smallest penguins?

### Response:
[Retrieval]<paragraph>Penguins consume a variety of seafood; their diet mainly consists of fish, squid, and krill, which they catch on their diving expeditions.</paragraph>
Prediction: [Relevant]The height of the smallest penguin species can vary depending on the species.[No support / Contradictory][Utility:5]
Score: 1.4213598342974367
10/10 paragraphs done

End evaluation
Selected the best answer: [Relevant]The smallest penguin species is the Little Blue Penguin (also known as the Fairy Penguin), which can grow to be around 40 centimeters (16 inches) in height.[Fully supported][Utility:5]
Final answer: The smallest penguin species is the Little Blue Penguin (also known as the Fairy Penguin), which can grow to be around 40 centimeters (16 inches) in height.

我们可以看到，第一个查询不需要检索，而第二个查询已经检索并进行了评估。

理解测试代码的关键在于 de>class SelfRAGQueryEngine 的实现，让我们深入了解一下该类。

类 SelfRAGQueryEngine

首先是构造函数，主要用于使用 llama_cpp 加载 Llama2-7B 模型。

class SelfRAGQueryEngine(CustomQueryEngine):
    """Simple short form self RAG query engine."""

    llm: Any = Field(default=None, description="llm")
    retriever: BaseRetriever = Field(default=None, description="retriever")
    generate_kwargs: Dict = Field(default=None, description="llm generation arguments")
    verbose: bool = Field(default=True, description="Verbose.")

    def __init__(
        self,
        model_path: str,
        retriever: BaseRetriever,
        verbose: bool = False,
        model_kwargs: Dict = None,
        generate_kwargs: Dict = None,
        **kwargs: Any,
    ) -> None:
        """Init params."""
        super().__init__(verbose=verbose, **kwargs)
        model_kwargs = model_kwargs or _MODEL_KWARGS
        self.generate_kwargs = generate_kwargs or _GENERATE_KWARGS
        try:
            from llama_cpp import Llama
        except ImportError:
            raise ImportError(_IMPORT_ERROR_MSG)
        self.llm = Llama(model_path=model_path, verbose=verbose, **model_kwargs)
        self.retriever = retriever

接下来，我们将介绍query function。其主要流程如图 3 所示：

在这里插入图片描述

关键部分已作注释，以便更好地理解。

def custom_query(self, query_str: str) -> Response:
        """Run self-RAG."""
        # Obtain responses using the Llama2 model.
        response = self.llm(prompt=_format_prompt(query_str), **_GENERATE_KWARGS)
        answer = response["choices"][0]["text"]
        source_nodes = []

        # Determine if a retrieval is necessary.
        if "[Retrieval]" in answer:
            if self.verbose:
                print_text("Retrieval required\n", color="blue")
            # The step 1 of Figure 1, retrieve as needed.
            documents = self.retriever.retrieve(query_str)
            if self.verbose:
                print_text(f"Received: {len(documents)} documents\n", color="blue")
            paragraphs = [
                _format_prompt(query_str, document.node.text) for document in documents
            ]

            if self.verbose:
                print_text("Start evaluation\n", color="blue")

            # Step 2 and 3 in Figure 1, generate in parallel and evaluate 
            # (the code does not implement parallelism)
            critic_output = self._run_critic(paragraphs)

            paragraphs_final_score = critic_output.paragraphs_final_score
            llm_response_per_paragraph = critic_output.llm_response_per_paragraph
            source_nodes = critic_output.source_nodes

            if self.verbose:
                print_text("End evaluation\n", color="blue")

            # Select the paragraph with the highest score and return it.
            best_paragraph_id = max(
                paragraphs_final_score, key=paragraphs_final_score.get
            )
            answer = llm_response_per_paragraph[best_paragraph_id]
            if self.verbose:
                print_text(f"Selected the best answer: {answer}\n", color="blue")

        answer = _postprocess_answer(answer)
        if self.verbose:
            print_text(f"Final answer: {answer}\n", color="green")
        return Response(response=str(answer), source_nodes=source_nodes)

从代码中我们可以看到，图 1 中的三个步骤都得到了体现。不过，LlamaIndex 的代码并没有实现并行化。有兴趣的读者可以查看 de>self._run_critic 函数，了解更多信息，该函数还处理与各种反射标记相对应的分数。

如何训练 Llama2-7B 模型

我们曾多次使用过 Llama2-7B 模型，让我们来探讨一下如何获得它。

培训目标

启用语言模型，生成包含反射标记的文本。

两种模式

在训练过程中，需要两个模型：批评者模型 C 和生成器模型 M。

不过，在推理过程中，只使用模型 M，而不需要模型 C。

评论家模式 C

批评者模型经过训练可生成反思标记。使用该模型的目的是在任务输出中离线插入反思标记，从而更新训练语料库。

人工标注每个片段的反射标记耗资巨大。由于每个反射标记的定义、输入和输出各不相同，Self-RAG 利用 GPT-4 为每个反射标记分配独特的指令，从而高效地完成数据注释任务。例如，[检索]标记的指令提示 GPT-4 评估纳入外部文档是否会增强结果。

获得训练数据 D_critic 后，我们就可以根据标准条件语言模型构建训练目标，如下所示：

在这里插入图片描述

批评者模型 C 可以用任何语言模型初始化。例如，可以使用与生成器相同的模型（如 Llama2-7B）进行初始化。

生成模型M

图 4 显示了收集训练数据的具体过程。给定一对输入-输出（x，y），self-RAG 使用检索模型和批评者模型增强原始输出 y，从而创建监督数据。对于每个片段 yt∈ y：

在这里插入图片描述

请注意，图 4 中的每个条件判断都是通过批判模型 C 执行的。获得的训练数据如图 5 所示：

在这里插入图片描述

在获得训练数据 D_gen 之后，我们可以构建标准的下一个标记词预测目标函数如下：

在这里插入图片描述

生成器 M 不仅需要预测输出，还需要预测反射标记。

对Self-RAG 的见解和思考

一般来说，自 RAG 为增强 RAG 过程提供了一个新的视角。不过，它需要更复杂的训练过程，以及在生成阶段的多重标签生成和判断，不可避免地增加了推理成本。这可能会严重影响需要实时性能的项目。

此外，这个框架还有很大的优化空间。为了引发进一步的讨论和创新，这里有几点建议：

如何优化反射令牌。Self-RAG 设计了四个反射令牌。除了[Retrieve]标记外，其他三个（[IsREL]、[IsSUP]、[IsUSE]）都有一定的相似性。考虑使用更少的反射标记或代表其他语义的反射标记是一个可行的方向。
**为什么批评家模型使用 LLM？**我认为这可能是由于[IsUSE]这样的标记在很大程度上依赖于常识。判断查询答案的有用性是较小模型也有可能完成的任务。但是，这些模型通常只能从特定的训练数据中学习，缺乏全面的知识。因此，使用 LLM 作为评论模型是有意义的。
**选择批评者模型尺寸。**我们用 7B 和 13B 模型对 Self-RAG 进行了测试，结果非常好。然而，如果我们改用较小的 LLM（如 3B），我们能观察到哪些差异？同样，如果我们过渡到更大的 LLM，比如 33B，我们又能预见到多大的提升呢？
**为什么不使用人类反馈强化学习（RLHF）呢？**论文建议在任务示例上训练目标语言模型。这些示例通过来自离线批评者模型的反思标记进行增强，因此与 RLHF 相比，训练成本要低得多。此外，self-RAG 中的反射标记使推理过程中的生成可控，而 RLHF 则侧重于训练过程中的人类偏好调整。不过，本文没有包含任何与 RLHF 相关的对比实验。