159 比较LLM路径提取器：构建知识图谱的三种方法

需要重新演唱

于 2024-09-26 14:00:41 发布

阅读量855

点赞数 20

分类专栏： llamindex文章文章标签： LLM rag

本文链接：https://blog.csdn.net/xycxycooo/article/details/142554198

版权

llamindex文章专栏收录该内容

162 篇文章 6 订阅

订阅专栏

https://docs.llamaindex.ai/en/stable/examples/property_graph/Dynamic_KG_Extraction/#comparison-and-analysis

比较LLM路径提取器：构建知识图谱的三种方法

在现代数据科学和人工智能领域，知识图谱（Knowledge Graph）已成为处理复杂信息的重要工具。知识图谱通过结构化的方式表示实体及其关系，使得信息的检索和理解变得更加高效。本文将深入探讨Llama Index中的三种LLM路径提取器：SimpleLLMPathExtractor、SchemaLLMPathExtractor和DynamicLLMPathExtractor，并比较它们在构建知识图谱时的表现。

前置知识

在开始之前，确保你具备以下基础知识：

Python基础：熟悉Python编程。
OpenAI API密钥：你需要一个OpenAI API密钥来使用OpenAI模型。
Llama Index：使用pip install llama-index安装Llama Index库。

环境设置

首先，让我们通过安装所需的包并配置OpenAI API密钥来设置环境。

# 安装Llama Index
%pip install llama_index pyvis wikipedia

# 设置OpenAI API密钥
import os
os.environ["OPENAI_API_KEY"] = "sk-..."

# 配置日志
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

# 导入必要的库
from llama_index.core import Document, PropertyGraphIndex
from llama_index.core.indices.property_graph import (
    SimpleLLMPathExtractor,
    SchemaLLMPathExtractor,
    DynamicLLMPathExtractor,
)
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

import wikipedia
import nest_asyncio

nest_asyncio.apply()

设置LLM后端

# 设置LLM
llm = OpenAI(temperature=0.0, model="gpt-3.5-turbo")

# 设置全局配置
Settings.llm = llm
Settings.chunk_size = 2048
Settings.chunk_overlap = 20

从维基百科获取原始文本

def get_wikipedia_content(title):
    try:
        page = wikipedia.page(title)
        return page.content
    except wikipedia.exceptions.DisambiguationError as e:
        print(f"Disambiguation page. Options: {e.options}")
    except wikipedia.exceptions.PageError:
        print(f"Page '{title}' does not exist.")
    return None

wiki_title = "Barack Obama"
content = get_wikipedia_content(wiki_title)

if content:
    document = Document(text=content, metadata={"title": wiki_title})
    print(
        f"Fetched content for '{wiki_title}' (length: {len(content)} characters)"
    )
else:
    print("Failed to fetch Wikipedia content.")

1. SimpleLLMPathExtractor

SimpleLLMPathExtractor是一个简单的路径提取器，它不使用任何预定义的架构，直接从文本中提取三元组。

kg_extractor = SimpleLLMPathExtractor(
    llm=llm, max_paths_per_chunk=20, num_workers=4
)

simple_index = PropertyGraphIndex.from_documents(
    [document],
    llm=llm,
    embed_kg_nodes=False,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

simple_index.property_graph_store.save_networkx_graph(
    name="./SimpleGraph.html"
)

simple_index.property_graph_store.get_triplets(
    entity_names=["Barack Obama", "Obama"]
)[:5]

结果分析

SimpleLLMPathExtractor生成的知识图谱可能包含大量多样化的关系，但由于缺乏预定义的架构，实体和关系的命名可能缺乏一致性。

2. DynamicLLMPathExtractor

DynamicLLMPathExtractor结合了SimpleLLMPathExtractor的灵活性和一些初始架构指导。它可以根据需要扩展实体和关系类型，同时保持一定的一致性。

无初始架构

kg_extractor = DynamicLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    num_workers=4,
    allowed_entity_types=None,
    allowed_relation_types=None,
    allowed_relation_props=[],
    allowed_entity_props=[],
)

dynamic_index = PropertyGraphIndex.from_documents(
    [document],
    llm=llm,
    embed_kg_nodes=False,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

dynamic_index.property_graph_store.save_networkx_graph(
    name="./DynamicGraph.html"
)

dynamic_index.property_graph_store.get_triplets(
    entity_names=["Barack Obama", "Obama"]
)[:5]

有初始架构

kg_extractor = DynamicLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    num_workers=4,
    allowed_entity_types=["POLITICIAN", "POLITICAL_PARTY"],
    allowed_relation_types=["PRESIDENT_OF", "MEMBER_OF"],
    allowed_relation_props=["description"],
    allowed_entity_props=["description"],
)

dynamic_index_2 = PropertyGraphIndex.from_documents(
    [document],
    llm=llm,
    embed_kg_nodes=False,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

dynamic_index_2.property_graph_store.save_networkx_graph(
    name="./DynamicGraph_2.html"
)

dynamic_index_2.property_graph_store.get_triplets(
    entity_names=["Barack Obama", "Obama"]
)[:5]

结果分析

DynamicLLMPathExtractor生成的知识图谱在多样性和一致性之间取得了平衡。它能够捕捉到一些重要的关系，而这些关系可能是基于架构的方法所忽略的。

3. SchemaLLMPathExtractor

SchemaLLMPathExtractor使用预定义的架构来提取三元组。生成的知识图谱具有更高的结构一致性，但可能错过一些不符合预定义架构的关系。

kg_extractor = SchemaLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    strict=False,
    possible_entities=None,
    possible_relations=None,
    possible_relation_props=["extra_description"],
    possible_entity_props=["extra_description"],
    num_workers=4,
)

schema_index = PropertyGraphIndex.from_documents(
    [document],
    llm=llm,
    embed_kg_nodes=False,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

schema_index.property_graph_store.save_networkx_graph(
    name="./SchemaGraph.html"
)

schema_index.property_graph_store.get_triplets(
    entity_names=["Barack Obama", "Obama"]
)[:5]

结果分析

SchemaLLMPathExtractor生成的知识图谱具有更高的结构一致性，但由于使用了预定义的架构，可能会错过一些重要的关系。

比较和分析

关键观察

SimpleLLMPathExtractor：生成的知识图谱可能包含最多样化的实体和关系，但由于缺乏预定义的架构，实体和关系的命名可能缺乏一致性。
SchemaLLMPathExtractor：生成的知识图谱应该是最一致的，但由于使用了预定义的架构，可能会错过一些重要的关系。
DynamicLLMPathExtractor：生成的知识图谱在多样性和一致性之间取得了平衡，能够捕捉到一些重要的关系，同时保持一定的结构。

选择提取器

SimpleLLMPathExtractor：适用于探索性分析，希望捕捉广泛的潜在关系，而不关心实体类型。
SchemaLLMPathExtractor：适用于具有明确领域的场景，希望确保提取的知识的一致性。
DynamicLLMPathExtractor：适用于希望在结构和灵活性之间取得平衡的场景，允许模型发现新的实体和关系类型，同时提供一些初始指导。

总结

通过Llama Index的三种LLM路径提取器，我们可以根据不同的需求构建知识图谱。SimpleLLMPathExtractor适用于探索性分析，SchemaLLMPathExtractor适用于具有明确领域的场景，而DynamicLLMPathExtractor则在多样性和一致性之间取得了平衡。希望这篇博客能帮助你更好地理解和应用知识图谱技术。