31 使用属性图索引：构建和查询知识图谱

最新推荐文章于 2024-10-27 23:58:55 发布

需要重新演唱

最新推荐文章于 2024-10-27 23:58:55 发布

阅读量422

点赞数 4

分类专栏： llamindex文章文章标签：知识图谱前端 html LLM RAG llamaindex

本文链接：https://blog.csdn.net/xycxycooo/article/details/141351872

版权

llamindex文章专栏收录该内容

162 篇文章 6 订阅

订阅专栏

使用属性图索引：构建和查询知识图谱

在LlamaIndex中，属性图索引（PropertyGraphIndex）是一种强大的工具，用于构建和查询知识图谱。属性图是一种知识集合，由带标签的节点（如实体类别、文本标签等）和属性（如元数据）组成，通过关系链接成结构化的路径。

使用方法

基本用法

你可以通过导入类并使用它来实现基本用法：

from llama_index.core import PropertyGraphIndex

# 创建索引
index = PropertyGraphIndex.from_documents(
    documents,
)

# 使用索引
retriever = index.as_retriever(
    include_text=True,  # 包含匹配路径的源块
    similarity_top_k=2,  # 向量知识图节点检索的top k
)
nodes = retriever.retrieve("Test")

query_engine = index.as_query_engine(
    include_text=True,  # 包含匹配路径的源块
    similarity_top_k=2,  # 向量知识图节点检索的top k
)
response = query_engine.query("Test")

# 保存和加载
index.storage_context.persist(persist_dir="./storage")

from llama_index.core import StorageContext, load_index_from_storage

index = load_index_from_storage(
    StorageContext.from_defaults(persist_dir="./storage")
)

# 从现有的图存储（和可选的向量存储）加载
index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store, vector_store=vector_store, ...
)

构建属性图

在LlamaIndex中，属性图的构建通过一系列的kg_extractors对每个块进行处理，并将实体和关系作为元数据附加到每个llama-index节点上。你可以使用多个kg_extractors，它们都会被应用。

如果没有提供，默认使用SimpleLLMPathExtractor和ImplicitPathExtractor。

index = PropertyGraphIndex.from_documents(
    documents,
    kg_extractors=[extractor1, extractor2, ...],
)

# 插入额外的文档/节点
index.insert(document)
index.insert_nodes(nodes)

kg_extractors详解

(默认) SimpleLLMPathExtractor

使用LLM提取短语句，提示并解析单跳路径，格式为(entity1, relation, entity2)：

from llama_index.core.indices.property_graph import SimpleLLMPathExtractor

kg_extractor = SimpleLLMPathExtractor(
    llm=llm,
    max_paths_per_chunk=10,
    num_workers=4,
    show_progress=False,
)

(默认) ImplicitPathExtractor

使用每个llama-index节点对象的node.relationships属性提取路径：

from llama_index.core.indices.property_graph import ImplicitPathExtractor

kg_extractor = ImplicitPathExtractor()

DynamicLLMPathExtractor

根据可选的允许实体类型和关系类型列表提取路径（包括实体类型！）。如果没有提供，LLM将根据需要分配类型。如果提供了，它将帮助引导LLM，但不会强制执行这些类型：

from llama_index.core.indices.property_graph import DynamicLLMPathExtractor

kg_extractor = DynamicLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    num_workers=4,
    allowed_entity_types=["POLITICIAN", "POLITICAL_PARTY"],
    allowed_relation_types=["PRESIDENT_OF", "MEMBER_OF"],
)

SchemaLLMPathExtractor

提取路径遵循严格的允许实体、关系和实体可以连接到哪些关系的模式：

from typing import Literal
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor

entities = Literal["PERSON", "PLACE", "THING"]
relations = Literal["PART_OF", "HAS", "IS_A"]
schema = {
    "PERSON": ["PART_OF", "HAS", "IS_A"],
    "PLACE": ["PART_OF", "HAS"],
    "THING": ["IS_A"],
}

kg_extractor = SchemaLLMPathExtractor(
    llm=llm,
    possible_entities=entities,
    possible_relations=relations,
    kg_validation_schema=schema,
    strict=True,  # 如果为false，将允许模式外的三元组
    num_workers=4,
    max_paths_per_chunk=10,
    show_progres=False,
)

检索和查询

标签属性图可以通过多种方式查询以检索节点和路径。在LlamaIndex中，我们可以同时结合多种节点检索方法！

# 创建检索器
retriever = index.as_retriever(sub_retrievers=[retriever1, retriever2, ...])

# 创建查询引擎
query_engine = index.as_query_engine(
    sub_retrievers=[retriever1, retriever2, ...]
)

如果没有提供子检索器，默认使用LLMSynonymRetriever和VectorContextRetriever（如果启用了嵌入）。

所有当前的检索器包括：

LLMSynonymRetriever
VectorContextRetriever
TextToCypherRetriever
CypherTemplateRetriever
CustomPGRetriever

通常，你会定义一个或多个这些子检索器并将它们传递给PGRetriever：

from llama_index.core.indices.property_graph import (
    PGRetriever,
    VectorContextRetriever,
    LLMSynonymRetriever,
)

sub_retrievers = [
    VectorContextRetriever(index.property_graph_store, ...),
    LLMSynonymRetriever(index.property_graph_store, ...),
]

retriever = PGRetriever(sub_retrievers=sub_retrievers)

nodes = retriever.retrieve("<query>")

存储

目前，支持属性图的图存储包括：

存储类型	原生嵌入支持	异步	基于服务器或磁盘
SimplePropertyGraphStore	✅	❌	磁盘
Neo4jPropertyGraphStore	❌	✅	服务器
NebulaPropertyGraphStore	❌	❌	服务器
TiDBPropertyGraphStore	❌	✅	服务器

保存到/从磁盘

默认的属性图存储SimplePropertyGraphStore将所有内容存储在内存中，并从磁盘持久化和加载。

以下是保存/加载索引的示例：

from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core.indices import PropertyGraphIndex

# 创建索引
index = PropertyGraphIndex.from_documents(documents)

# 保存
index.storage_context.persist("./storage")

# 加载
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

使用集成保存和加载

集成通常会自动保存。一些图存储支持向量，而另一些则不支持。你总是可以将图存储与外部向量数据库结合使用。

以下示例展示了如何使用Neo4j和Qdrant保存/加载属性图索引：

from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core.indices import PropertyGraphIndex
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, AsyncQdrantClient

vector_store = QdrantVectorStore(
    "graph_collection",
    client=QdrantClient(...),
    aclient=AsyncQdrantClient(...),
)

graph_store = Neo4jPropertyGraphStore(
    username="neo4j",
    password="<password>",
    url="bolt://localhost:7687",
)

# 创建索引
index = PropertyGraphIndex.from_documents(
    documents,
    property_graph_store=graph_store,
    # 可选，neo4j也直接支持向量
    vector_store=vector_store,
    embed_kg_nodes=True,
)

# 从现有的图/向量存储加载
index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store,
    # 可选，neo4j也直接支持向量
    vector_store=vector_store,
    embed_kg_nodes=True,
)

直接使用属性图存储

属性图存储的基础存储类是PropertyGraphStore。这些属性图存储使用不同类型的LabeledNode对象构建，并通过Relation对象连接。

我们可以自己创建这些对象，并自己插入！

from llama_index.core.graph_stores import (
    SimplePropertyGraphStore,
    EntityNode,
    Relation,
)
from llama_index.core.schema import TextNode

graph_store = SimplePropertyGraphStore()

entities = [
    EntityNode(name="llama", label="ANIMAL", properties={"key": "val"}),
    EntityNode(name="index", label="THING", properties={"key": "val"}),
]

relations = [
    Relation(
        label="HAS",
        source_id=entities[0].id,
        target_id=entities[1].id,
        properties={},
    )
]

graph_store.upsert_nodes(entities)
graph_store.upsert_relations(relations)

# 可选地，我们也可以插入文本块
source_chunk = TextNode(id_="source", text="My llama has an index.")

# 为每个实体创建关系
source_relations = [
    Relation(
        label="HAS_SOURCE",
        source_id=entities[0].id,
        target_id="source",
    ),
    Relation(
        label="HAS_SOURCE",
        source_id=entities[1].id,
        target_id="source",
    ),
]
graph_store.upsert_llama_nodes([source_chunk])
graph_store.upsert_relations(source_relations)

图存储上其他有用的方法包括：

graph_store.get(ids=[]) - 根据ID获取节点
graph_store.get(properties={"key": "val"}) - 根据匹配的属性获取节点
graph_store.get_rel_map([entity_node], depth=2) - 获取一定深度的三元组
graph_store.get_llama_nodes(['id1']) - 获取原始文本节点
graph_store.delete(ids=['id1']) - 根据ID删除
graph_store.delete(properties={"key": "val"}) - 根据属性删除
graph_store.structured_query("<cypher query>") - 运行Cypher查询（假设图存储支持）

此外，所有这些方法都有异步版本（如aget, adelete等）。

高级定制

与LlamaIndex中的所有组件一样，你可以子类化模块并自定义它们，以使其完全按照你的需要工作，或者尝试新的想法并研究新的模块！

子类化提取器

LlamaIndex中的图提取器子类化TransformComponent类。如果你之前使用过摄取管道，这会很熟悉，因为它是同一个类。

提取器的要求是将图数据插入节点的元数据，这些数据随后将由索引处理。

以下是一个子类化以创建自定义提取器的示例：

from llama_index.core.graph_store.types import (
    EntityNode,
    Relation,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
)
from llama_index.core.schema import BaseNode, TransformComponent

class MyGraphExtractor(TransformComponent):
    # 初始化是可选的
    # def __init__(self, ...):
    #     ...

    def __call__(
        self, llama_nodes: list[BaseNode], **kwargs
    ) -> list[BaseNode]:
        for llama_node in llama_nodes:
            # 确保不覆盖现有的实体/关系

            existing_nodes = llama_node.metadata.pop(KG_NODES_KEY, [])
            existing_relations = llama_node.metadata.pop(KG_RELATIONS_KEY, [])

            existing_nodes.append(
                EntityNode(
                    name="llama", label="ANIMAL", properties={"key": "val"}
                )
            )
            existing_nodes.append(
                EntityNode(
                    name="index", label="THING", properties={"key": "val"}
                )
            )

            existing_relations.append(
                Relation(
                    label="HAS",
                    source_id="llama",
                    target_id="index",
                    properties={},
                )
            )

            # 添加回元数据

            llama_node.metadata[KG_NODES_KEY] = existing_nodes
            llama_node.metadata[KG_RELATIONS_KEY] = existing_relations

        return llama_nodes

    # 可选的异步方法
    # async def acall(self, llama_nodes: list[BaseNode], **kwargs) -> list[BaseNode]:
    #    ...

子类化检索器

检索器比提取器复杂一些，并且有专门的类来帮助简化子类化。

检索的返回类型非常灵活。它可以是：

字符串
TextNode
NodeWithScore
以上任意一个的列表

以下是一个子类化以创建自定义检索器的示例：

from llama_index.core.indices.property_graph import (
    CustomPGRetriever,
    CUSTOM_RETRIEVE_TYPE,
)

class MyCustomRetriever(CustomPGRetriever):
    def init(self, my_option_1: bool = False, **kwargs) -> None:
        """使用类构造函数传递的任何kwargs。"""
        self.my_option_1 = my_option_1
        # 可选地对self.graph_store进行操作

    def custom_retrieve(self, query_str: str) -> CUSTOM_RETRIEVE_TYPE:
        # 对self.graph_store进行一些操作
        return "result"

    # 可选的异步方法
    # async def acustom_retrieve(self, query_str: str) -> str:
    #     ...

custom_retriever = MyCustomRetriever(graph_store, my_option_1=True)

retriever = index.as_retriever(sub_retrievers=[custom_retriever])