147 自定义属性图检索器：深入LlamaIndex的高级检索流程

最新推荐文章于 2024-09-28 18:52:38 发布

需要重新演唱

最新推荐文章于 2024-09-28 18:52:38 发布

阅读量940

点赞数 16

分类专栏： llamindex文章文章标签： RAG LLM

本文链接：https://blog.csdn.net/xycxycooo/article/details/142517296

版权

llamindex文章专栏收录该内容

162 篇文章 4 订阅

订阅专栏

自定义属性图检索器：深入LlamaIndex的高级检索流程

在数据检索领域，属性图（Property Graph）是一种强大的工具，能够帮助我们建模和查询复杂的数据关系。然而，有时我们需要更精细的控制检索过程，以更好地适应特定的应用场景。在本篇博客中，我们将深入探讨如何在LlamaIndex中定义一个自定义的属性图检索器（Custom Property Graph Retriever），并通过详细的代码示例和技术解释，帮助程序员快速理解和应用这一技术。

前置知识

在深入探讨自定义属性图检索器之前，我们需要了解一些基础概念：

属性图（Property Graph）：属性图是一种图数据库模型，其中节点和边都可以拥有属性。节点表示实体，边表示实体之间的关系，属性则提供了关于节点和边的附加信息。
检索器（Retriever）：检索器是用于从数据集中检索相关信息的组件。在LlamaIndex中，检索器通常用于从属性图中检索节点或路径。
嵌入（Embedding）：嵌入是将高维数据映射到低维空间的过程，通常用于机器学习和自然语言处理中，以便更好地表示和处理数据。
大语言模型（LLM）：大语言模型是一种基于深度学习的模型，能够理解和生成自然语言文本。在本例中，我们使用OpenAI的GPT-3.5-turbo模型。

安装和设置

首先，我们需要安装LlamaIndex库以及相关的依赖库。以下是安装和设置的代码：

%pip install llama-index
%pip install llama-index-graph-stores-neo4j
%pip install llama-index-postprocessor-cohere-rerank

import nest_asyncio
nest_asyncio.apply()

import os
os.environ["OPENAI_API_KEY"] = "sk-..."

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

代码解释

安装依赖库：使用%pip install命令安装LlamaIndex、Neo4j和Cohere Rerank库。
设置OpenAI API密钥：将OpenAI的API密钥存储在环境变量中。
下载示例数据：使用wget命令下载Paul Graham的散文作为示例数据。
加载数据：使用SimpleDirectoryReader加载下载的文档。

定义默认的LLM和嵌入模型

接下来，我们定义默认的大语言模型（LLM）和嵌入模型。以下是定义的代码：

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)
embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")

代码解释

导入必要的模块：导入OpenAIEmbedding和OpenAI模块。
定义LLM：使用OpenAI类定义LLM模型（GPT-3.5-turbo）。
定义嵌入模型：使用OpenAIEmbedding类定义嵌入模型（text-embedding-3-small）。

设置Neo4j

为了在本地启动Neo4j数据库，首先确保你已经安装了Docker。然后，可以使用以下Docker命令启动数据库：

docker run \
    -p 7474:7474 -p 7687:7687 \
    -v $PWD/data:/data -v $PWD/plugins:/plugins \
    --name neo4j-apoc \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
    neo4j:latest

启动后，可以在浏览器中打开http://localhost:7474/访问数据库。默认的用户名和密码是neo4j和neo4j。

以下是设置Neo4j的代码：

from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore

graph_store = Neo4jPropertyGraphStore(
    username="neo4j",
    password="llamaindex",
    url="bolt://localhost:7687",
)

代码解释

导入Neo4jPropertyGraphStore：导入Neo4jPropertyGraphStore类。
创建Neo4j图存储：使用Neo4jPropertyGraphStore类创建Neo4j图存储实例，并指定用户名、密码和URL。

构建属性图

接下来，我们使用加载的文档构建属性图。以下是构建属性图的代码：

from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex.from_documents(
    documents,
    llm=llm,
    embed_model=embed_model,
    property_graph_store=graph_store,
    show_progress=True,
)

代码解释

导入PropertyGraphIndex：导入PropertyGraphIndex类。
构建属性图索引：使用from_documents方法从文档中构建属性图索引，指定LLM模型、嵌入模型和图存储。

定义自定义检索器

现在，我们定义一个自定义检索器，通过子类化CustomPGRetriever来实现。以下是定义自定义检索器的代码：

from llama_index.core.retrievers import (
    CustomPGRetriever,
    VectorContextRetriever,
    TextToCypherRetriever,
)
from llama_index.core.graph_stores import PropertyGraphStore
from llama_index.core.vector_stores.types import VectorStore
from llama_index.core.embeddings import BaseEmbedding
from llama_index.core.prompts import PromptTemplate
from llama_index.core.llms import LLM
from llama_index.postprocessor.cohere_rerank import CohereRerank

from typing import Optional, Any, Union

class MyCustomRetriever(CustomPGRetriever):
    """Custom retriever with cohere reranking."""

    def init(
        self,
        ## vector context retriever params
        embed_model: Optional[BaseEmbedding] = None,
        vector_store: Optional[VectorStore] = None,
        similarity_top_k: int = 4,
        path_depth: int = 1,
        ## text-to-cypher params
        llm: Optional[LLM] = None,
        text_to_cypher_template: Optional[Union[PromptTemplate, str]] = None,
        ## cohere reranker params
        cohere_api_key: Optional[str] = None,
        cohere_top_n: int = 2,
        **kwargs: Any,
    ) -> None:
        """Uses any kwargs passed in from class constructor."""

        self.vector_retriever = VectorContextRetriever(
            self.graph_store,
            include_text=self.include_text,
            embed_model=embed_model,
            vector_store=vector_store,
            similarity_top_k=similarity_top_k,
            path_depth=path_depth,
        )

        self.cypher_retriever = TextToCypherRetriever(
            self.graph_store,
            llm=llm,
            text_to_cypher_template=text_to_cypher_template
            ## NOTE: you can attach other parameters here if you'd like
        )

        self.reranker = CohereRerank(
            api_key=cohere_api_key, top_n=cohere_top_n
        )

    def custom_retrieve(self, query_str: str) -> str:
        """Define custom retriever with reranking.

        Could return `str`, `TextNode`, `NodeWithScore`, or a list of those.
        """
        nodes_1 = self.vector_retriever.retrieve(query_str)
        nodes_2 = self.cypher_retriever.retrieve(query_str)
        reranked_nodes = self.reranker.postprocess_nodes(
            nodes_1 + nodes_2, query_str=query_str
        )

        ## TMP: please change
        final_text = "\n\n".join(
            [n.get_content(metadata_mode="llm") for n in reranked_nodes]
        )

        return final_text

    # optional async method
    # async def acustom_retrieve(self, query_str: str) -> str:
    #     ...

代码解释

导入必要的模块：导入CustomPGRetriever、VectorContextRetriever、TextToCypherRetriever、CohereRerank等模块。
定义自定义检索器类：定义一个名为MyCustomRetriever的类，继承自CustomPGRetriever。
初始化方法：在init方法中，初始化两个现有的属性图检索器（VectorContextRetriever和TextToCypherRetriever）以及Cohere Reranker。
自定义检索方法：在custom_retrieve方法中，定义自定义的检索逻辑，通过两个检索器检索节点，并使用Cohere Reranker对结果进行重排序。

测试自定义检索器

现在，我们初始化并测试自定义检索器。以下是测试的代码：

custom_sub_retriever = MyCustomRetriever(
    index.property_graph_store,
    include_text=True,
    vector_store=index.vector_store,
    cohere_api_key="...",
)

from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    index.as_retriever(sub_retrievers=[custom_sub_retriever]), llm=llm
)

代码解释

初始化自定义检索器：使用MyCustomRetriever类初始化自定义检索器，并指定图存储、向量存储和Cohere API密钥。
创建查询引擎：使用RetrieverQueryEngine类创建查询引擎，并指定自定义检索器和LLM模型。

尝试一些查询

最后，我们尝试一些查询，并比较自定义检索器和基线检索器的结果。以下是查询的代码：

response = query_engine.query("Did the author like programming?")
print(str(response))

base_retriever = VectorContextRetriever(
    index.property_graph_store, include_text=True
)
base_query_engine = index.as_query_engine(sub_retrievers=[base_retriever])

response = base_query_engine.query("Did the author like programming?")
print(str(response))