《Advanced RAG》-12-增进RAG的全局理解（一）-CSDN博客

本文链接：https://blog.csdn.net/JingYu_365/article/details/141112680

总结

本文介绍了四种改进RAG（Retrieval-Augmented Generation）模型的方法，以增强对文档或语料库的全面理解，并详细阐述了每种方法的理论基础、实现过程和实验结果。

摘要

本文详细介绍了四种增强全局理解能力的RAG方法：RAPTOR、Graph RAG、HippoRAG和spRAG。

RAPTOR使用树状结构组织文本块，通过聚类和总结生成多层次摘要，以便于检索信息。
Graph RAG构建基于图的文本索引，利用知识图谱和社群检测技术，通过查询时的社群摘要来理解整个文本语料。
HippoRAG受人脑记忆机制启发，结合LLM（Large Language Models）、知识图谱和个性化PageRank算法，模拟人类记忆的模式分离和完成功能。
spRAG通过自动上下文注入和相关片段提取技术，提高了标准RAG系统的性能。本文还对这些方法进行了比较，讨论了它们的数据结构、检索算法、性能以及可定制性

现实世界中的许多重要任务，包括科学文献综述、法律案件简报和医疗诊断，都需要跨块或跨文档的知识理解。

现有的 RAG 方法无法帮助 LLMs 完成要求理解跨语块边界信息的任务，因为每个语块都是独立编码的。

本文将介绍四种创新方法，以增强对文档或语料库的全面理解，以及从中获得的启示和思考。

这四种方法如下：

RAPTOR：这是一个基于树的检索系统，可递归嵌入、聚类和总结文本块。
Graph RAG：该方法结合了知识图谱生成、社群检测、RAG 和查询式摘要（QFS），有助于全面了解整个文本语料库。
HippoRAG：这一检索框架从人类长期记忆的海马索引理论中汲取灵感。它与 LLM、知识图谱和个性化 PageRank 算法协作。
spRAG：该方法通过两项关键技术，即自动上下文和相关片段提取（RSE），提高了标准 RAG 系统的性能。

RAPTOR：树状组织检索的递归抽象处理技术

RAPTOR 是一种新颖的基于树的检索系统，设计用于递归嵌入、聚类和总结文本片段。它自下而上地构建一棵树，提供不同层次的摘要。

在推理过程中，RAPTOR 会从这棵树中检索信息，并将更长文档中的数据以不同的抽象程度纳入其中。

关键思路

RAPTOR 采用递归方法，根据嵌入将文本块组织成群。它为每个簇生成摘要，自下而上地构建一棵树。这一过程如图 1 所示。

在这里插入图片描述

下面我们将深入探讨与图 1 有关的具体主题：

构建 RAPTOR 树
检索过程

构建 RAPTOR 树

文本分块

将检索语料分成连续的Chunk，每块 100 个token。如果一个Chunk超过 100 个token，RAPTOR 会将整个句子转移到下一个Chunk，以保持上下文和语义的连贯性。

def split_text(
    text: str, tokenizer: tiktoken.get_encoding("cl100k_base"), max_tokens: int, overlap: int = 0):
    """
    Splits the input text into smaller chunks based on the tokenizer and maximum allowed tokens.
    
    Args:
        text (str): The text to be split.
        tokenizer (CustomTokenizer): The tokenizer to be used for splitting the text.
        max_tokens (int): The maximum allowed tokens.
        overlap (int, optional): The number of overlapping tokens between chunks. Defaults to 0.
    
    Returns:
        List[str]: A list of text chunks.
    """
    ...
    ...        
        # If adding the sentence to the current chunk exceeds the max tokens, start a new chunkelif current_length + token_count > max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = current_chunk[-overlap:] if overlap > 0 else []
            current_length = sum(n_tokens[max(0, len(current_chunk) - overlap):len(current_chunk)])
            current_chunk.append(sentence)
            current_length += token_count
    ...
    ...

嵌入

使用 Sentence-BERT 生成这些语块的密集向量表示。

这些块及其相应的嵌入构成了 RAPTOR 树形结构的叶节点。

class TreeBuilder:
    """
    The TreeBuilder class is responsible for building a hierarchical text abstraction
    structure, known as a "tree," using summarization models and
    embedding models.
    """
    ...
    ...
    def build_from_text(self, text: str, use_multithreading: bool = True) -> Tree:
        """Builds a golden tree from the input text, optionally using multithreading.

        Args:
            text (str): The input text.
            use_multithreading (bool, optional): Whether to use multithreading when creating leaf nodes.
                Default: True.

        Returns:
            Tree: The golden tree structure.
        """
        chunks = split_text(text, self.tokenizer, self.max_tokens)

        logging.info("Creating Leaf Nodes")

        if use_multithreading:
            leaf_nodes = self.multithreaded_create_leaf_nodes(chunks)
        else:
            leaf_nodes = {}
            for index, text in enumerate(chunks):
                __, node = self.create_node(index, text)
                leaf_nodes[index] = node

        layer_to_nodes = {0: list(leaf_nodes.values())}

        logging.info(f"Created {len(leaf_nodes)} Leaf Embeddings")
        ...
        ...

聚类方法

聚类对于构建 RAPTOR 树至关重要，因为它能将文本段落组织成连贯的组。通过将相关内容集中在一起，可以增强后续的检索过程。

RAPTOR 的聚类方法具有以下特点：

它采用高斯混合物模型（GMM）和 UMAP 维度缩减技术进行软聚类。
可以修改 UMAP 参数，以识别全局和局部集群。
贝叶斯信息标准（BIC）用于模型选择，以确定最佳聚类数量。

这种聚类方法的核心是一个节点可以属于多个聚类。这样就不需要固定数量的类别，因为一个文本片段往往包含不同主题的信息，从而确保将其纳入多个摘要中。

在使用 GMM 对节点进行聚类后，每个聚类中的节点将由 LLM 进行总结。这一过程将大块内容转化为所选节点的简洁、连贯的摘要。

在执行过程中，使用 gpt-3.5 turbo 生成摘要。相应的提示如图 2 所示。

在这里插入图片描述

构建算法

至此，我们已经获得了整棵树的叶节点，并确定了聚类算法。

如图 1 中部所示，组合在一起的节点构成同级节点，而父节点则包含该特定群组的摘要。生成的摘要包括树中的非叶节点。

汇总后的节点被重新嵌入，嵌入、聚类和汇总的过程一直持续到进一步聚类不再可行为止。这样，原始文件就形成了一个结构化的多层树状表示。

相应的代码如下所示。

class ClusterTreeConfig(TreeBuilderConfig):
    ...
    ...
    def construct_tree(
        self,
        current_level_nodes: Dict[int, Node],
        all_tree_nodes: Dict[int, Node],
        layer_to_nodes: Dict[int, List[Node]],
        use_multithreading: bool = False,
    ) -> Dict[int, Node]:
        ...
        ...

        for layer in range(self.num_layers):

            new_level_nodes = {}

            logging.info(f"Constructing Layer {layer}")

            node_list_current_layer = get_node_list(current_level_nodes)

            if len(node_list_current_layer) <= self.reduction_dimension + 1:
                self.num_layers = layer
                logging.info(
                    f"Stopping Layer construction: Cannot Create More Layers. Total Layers in tree: {layer}"
                )
                break

            clusters = self.clustering_algorithm.perform_clustering(
                node_list_current_layer,
                self.cluster_embedding_model,
                reduction_dimension=self.reduction_dimension,
                **self.clustering_params,
            )

            lock = Lock()

            summarization_length = self.summarization_length
            logging.info(f"Summarization Length: {summarization_length}")

            ...
            ...

检索流程

有了 RAPTOR 树之后，应该如何使用它进行查询？

查询有两种方式：基于树遍历和基于折叠树，如图 3 所示。

在这里插入图片描述

树遍历从树的根层开始，根据节点与查询向量的余弦相似度检索前 k 节点（本例中为前 1 节点）。在每一层，它都会从上一层的前 k 节点的子节点中检索前 k 节点，相应的代码如下所示。

class TreeRetriever(BaseRetriever):
    ...
    ...
    def retrieve_information(
        self, current_nodes: List[Node], query: str, num_layers: int) -> str:
        """
        Retrieves the most relevant information from the tree based on the query.

        Args:
            current_nodes (List[Node]): A List of the current nodes.
            query (str): The query text.
            num_layers (int): The number of layers to traverse.

        Returns:
            str: The context created using the most relevant nodes.
        """

        query_embedding = self.create_embedding(query)

        selected_nodes = []

        node_list = current_nodes

        for layer in range(num_layers):

            embeddings = get_embeddings(node_list, self.context_embedding_model)

            distances = distances_from_embeddings(query_embedding, embeddings)

            indices = indices_of_nearest_neighbors_from_distances(distances)

            if self.selection_mode == "threshold":
                best_indices = [
                    index for index in indices if distances[index] > self.threshold
                ]

            elif self.selection_mode == "top_k":
                best_indices = indices[: self.top_k]

            nodes_to_add = [node_list[idx] for idx in best_indices]

            selected_nodes.extend(nodes_to_add)

            if layer != num_layers - 1:

                child_nodes = []

                for index in best_indices:
                    child_nodes.extend(node_list[index].children)

                # take the unique values
                child_nodes = list(dict.fromkeys(child_nodes))
                node_list = [self.tree.all_nodes[i] for i in child_nodes]

        context = get_text(selected_nodes)
        return selected_nodes, context

相比之下，折叠树将树压缩为单层，并检索节点，直到达到标记的阈值数量，同样是基于与查询向量的余弦相似度，相应的代码如下所示。

class TreeRetriever(BaseRetriever):
    ...
    ...
    def retrieve_information_collapse_tree(self, query: str, top_k: int, max_tokens: int) -> str:
        """
        Retrieves the most relevant information from the tree based on the query.

        Args:
            query (str): The query text.
            max_tokens (int): The maximum number of tokens.

        Returns:
            str: The context created using the most relevant nodes.
        """

        query_embedding = self.create_embedding(query)

        selected_nodes = []

        node_list = get_node_list(self.tree.all_nodes)

        embeddings = get_embeddings(node_list, self.context_embedding_model)

        distances = distances_from_embeddings(query_embedding, embeddings)

        indices = indices_of_nearest_neighbors_from_distances(distances)

        total_tokens = 0for idx in indices[:top_k]:

            node = node_list[idx]
            node_tokens = len(self.tokenizer.encode(node.text))

            if total_tokens + node_tokens > max_tokens:
                break

            selected_nodes.append(node)
            total_tokens += node_tokens

        context = get_text(selected_nodes)
        return selected_nodes, context