146 深入探索LlamaIndex中的Property Graph Index：从入门到精通

需要重新演唱

于 2024-09-25 10:38:14 发布

阅读量866

点赞数 10

分类专栏： llamindex文章文章标签： RAG LLM

本文链接：https://blog.csdn.net/xycxycooo/article/details/142517167

版权

llamindex文章专栏收录该内容

162 篇文章 6 订阅

订阅专栏

深入探索LlamaIndex中的Property Graph Index：从入门到精通

在数据管理和检索的世界中，属性图索引（Property Graph Index）是一种强大的工具，它允许我们建模、存储和查询数据实体之间的复杂关系。在本篇博客中，我们将深入探讨LlamaIndex中的Property Graph Index，通过详细的代码示例和技术解释，帮助程序员快速理解和应用这一技术。

前置知识

在深入探讨Property Graph Index之前，我们需要了解一些基础概念：

属性图（Property Graph）：属性图是一种图数据库模型，其中节点和边都可以拥有属性。节点表示实体，边表示实体之间的关系，属性则提供了关于节点和边的附加信息。
嵌入（Embedding）：嵌入是将高维数据映射到低维空间的过程，通常用于机器学习和自然语言处理中，以便更好地表示和处理数据。
大语言模型（LLM）：大语言模型是一种基于深度学习的模型，能够理解和生成自然语言文本。在本例中，我们使用OpenAI的GPT-3.5-turbo模型。

安装和设置

首先，我们需要安装LlamaIndex库，并设置OpenAI的API密钥。以下是安装和设置的代码：

%pip install llama-index

import os

os.environ["OPENAI_API_KEY"] = "sk-proj-..."
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

import nest_asyncio
nest_asyncio.apply()

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

代码解释

安装LlamaIndex：使用%pip install llama-index命令安装LlamaIndex库。
设置OpenAI API密钥：将OpenAI的API密钥存储在环境变量中。
下载示例数据：使用wget命令下载Paul Graham的散文作为示例数据。
加载数据：使用SimpleDirectoryReader加载下载的文档。

构建Property Graph Index

接下来，我们将使用加载的文档构建Property Graph Index。以下是构建索引的代码：

from llama_index.core import PropertyGraphIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

index = PropertyGraphIndex.from_documents(
    documents,
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
    show_progress=True,
)

代码解释

导入必要的模块：导入PropertyGraphIndex、OpenAIEmbedding和OpenAI模块。
构建索引：使用from_documents方法从文档中构建Property Graph Index。参数包括文档、LLM模型（GPT-3.5-turbo）、嵌入模型（text-embedding-3-small），并显示进度。

索引构建过程回顾

在构建索引的过程中，发生了以下几个步骤：

解析节点：将文档解析为节点。
从文本中提取路径：将节点传递给LLM，LLM生成知识图谱的三元组（路径）。
提取隐式路径：使用每个节点的relationships属性推断隐式路径。
生成嵌入：为每个文本节点和图节点生成嵌入（因此这一步骤会执行两次）。

探索生成的图

为了调试目的，默认的SimplePropertyGraphStore包含一个帮助函数，可以将图的networkx表示保存为HTML文件。以下是保存图的代码：

index.property_graph_store.save_networkx_graph(name="./kg.html")

代码解释

保存图：使用save_networkx_graph方法将图保存为HTML文件。
查看图：在浏览器中打开生成的HTML文件，可以看到生成的图。每个“密集”节点实际上是源块，提取的实体和关系从这里分支出来。

自定义低级构建

如果需要，我们可以使用低级API进行相同的摄取过程，利用kg_extractors。以下是自定义构建的代码：

from llama_index.core.indices.property_graph import (
    ImplicitPathExtractor,
    SimpleLLMPathExtractor,
)

index = PropertyGraphIndex.from_documents(
    documents,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
    kg_extractors=[
        ImplicitPathExtractor(),
        SimpleLLMPathExtractor(
            llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
            num_workers=4,
            max_paths_per_chunk=10,
        ),
    ],
    show_progress=True,
)

代码解释

导入提取器：导入ImplicitPathExtractor和SimpleLLMPathExtractor。
自定义构建：使用from_documents方法，指定嵌入模型和知识图谱提取器。

查询Property Graph Index

查询Property Graph Index通常涉及使用一个或多个子检索器并组合结果。以下是查询的代码：

retriever = index.as_retriever(
    include_text=False,  # 包含源文本，默认True
)

nodes = retriever.retrieve("What happened at Interleaf and Viaweb?")

for node in nodes:
    print(node.text)

代码解释

创建检索器：使用as_retriever方法创建检索器，并设置include_text为False。
检索节点：使用检索器检索与查询相关的节点。
打印节点文本：遍历检索到的节点并打印其文本。

存储和加载索引

默认情况下，存储使用简单的内存抽象——SimpleVectorStore用于嵌入，SimplePropertyGraphStore用于属性图。我们可以将这些存储到磁盘并从磁盘加载。以下是存储和加载的代码：

index.storage_context.persist(persist_dir="./storage")

from llama_index.core import StorageContext, load_index_from_storage

index = load_index_from_storage(
    StorageContext.from_defaults(persist_dir="./storage")
)

代码解释

持久化存储：使用persist方法将索引存储到指定目录。
从存储加载：使用load_index_from_storage方法从指定目录加载索引。

结合向量存储

虽然某些图数据库支持向量（如Neo4j），但我们仍然可以指定向量存储以覆盖默认设置。以下是将ChromaVectorStore与默认的SimplePropertyGraphStore结合的代码：

%pip install llama-index-vector-stores-chroma

from llama_index.core.graph_stores import SimplePropertyGraphStore
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

client = chromadb.PersistentClient("./chroma_db")
collection = client.get_or_create_collection("my_graph_vector_db")

index = PropertyGraphIndex.from_documents(
    documents,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
    graph_store=SimplePropertyGraphStore(),
    vector_store=ChromaVectorStore(collection=collection),
    show_progress=True,
)

index.storage_context.persist(persist_dir="./storage")