160 Llama Index中的DynamicLLMPathExtractor：构建知识图谱的灵活方法

最新推荐文章于 2024-10-18 20:02:34 发布

需要重新演唱

最新推荐文章于 2024-10-18 20:02:34 发布

阅读量323

点赞数 4

分类专栏： llamindex文章文章标签： llama 知识图谱人工智能

本文链接：https://blog.csdn.net/xycxycooo/article/details/142554351

版权

llamindex文章专栏收录该内容

162 篇文章 6 订阅

订阅专栏

Llama Index中的DynamicLLMPathExtractor：构建知识图谱的灵活方法

在现代数据科学和人工智能领域，知识图谱（Knowledge Graph）已成为处理复杂信息的重要工具。知识图谱通过结构化的方式表示实体及其关系，使得信息的检索和理解变得更加高效。本文将深入探讨Llama Index中的DynamicLLMPathExtractor，帮助程序员全面理解其工作原理及实际应用。

前置知识

在开始之前，确保你具备以下基础知识：

Python基础：熟悉Python编程。
OpenAI API密钥：你需要一个OpenAI API密钥来使用OpenAI模型。
Llama Index：使用pip install llama-index安装Llama Index库。

环境设置

首先，让我们通过安装所需的包并配置OpenAI API密钥来设置环境。

# 安装Llama Index
%pip install llama_index pyvis wikipedia

# 设置OpenAI API密钥
import os
os.environ["OPENAI_API_KEY"] = "sk-..."

# 配置日志
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

# 导入必要的库
from llama_index.core import Document, PropertyGraphIndex
from llama_index.core.indices.property_graph import DynamicLLMPathExtractor
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

import wikipedia
import nest_asyncio

nest_asyncio.apply()

设置LLM后端

# 设置LLM
llm = OpenAI(temperature=0.0, model="gpt-3.5-turbo")

# 设置全局配置
Settings.llm = llm
Settings.chunk_size = 2048
Settings.chunk_overlap = 20

从维基百科获取原始文本

def get_wikipedia_content(title):
    try:
        page = wikipedia.page(title)
        return page.content
    except wikipedia.exceptions.DisambiguationError as e:
        print(f"Disambiguation page. Options: {e.options}")
    except wikipedia.exceptions.PageError:
        print(f"Page '{title}' does not exist.")
    return None

wiki_title = "Barack Obama"
content = get_wikipedia_content(wiki_title)

if content:
    document = Document(text=content, metadata={"title": wiki_title})
    print(
        f"Fetched content for '{wiki_title}' (length: {len(content)} characters)"
    )
else:
    print("Failed to fetch Wikipedia content.")

DynamicLLMPathExtractor

DynamicLLMPathExtractor是一种灵活的路径提取器，它结合了SimpleLLMPathExtractor的灵活性和SchemaLLMPathExtractor的一些初始架构指导。它可以根据需要扩展实体和关系类型，同时保持一定的一致性。

无初始架构

在没有初始架构的情况下，DynamicLLMPathExtractor允许LLM完全自由地推断架构，根据其最佳判断进行标记。

kg_extractor = DynamicLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    num_workers=4,
    allowed_entity_types=None,
    allowed_relation_types=None,
    allowed_relation_props=[],
    allowed_entity_props=[],
)

dynamic_index = PropertyGraphIndex.from_documents(
    [document],
    llm=llm,
    embed_kg_nodes=False,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

dynamic_index.property_graph_store.save_networkx_graph(
    name="./DynamicGraph.html"
)

dynamic_index.property_graph_store.get_triplets(
    entity_names=["Barack Obama", "Obama"]
)[:5]

有初始架构

在有初始架构的情况下，DynamicLLMPathExtractor提供了一些初始实体和关系类型，以指导LLM在标记过程中的决策。这并不保证LLM会使用这些实体和关系，但它提供了一些想法，LLM可以根据需要进行扩展。

kg_extractor = DynamicLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    num_workers=4,
    allowed_entity_types=["POLITICIAN", "POLITICAL_PARTY"],
    allowed_relation_types=["PRESIDENT_OF", "MEMBER_OF"],
    allowed_relation_props=["description"],
    allowed_entity_props=["description"],
)

dynamic_index_2 = PropertyGraphIndex.from_documents(
    [document],
    llm=llm,
    embed_kg_nodes=False,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

dynamic_index_2.property_graph_store.save_networkx_graph(
    name="./DynamicGraph_2.html"
)

dynamic_index_2.property_graph_store.get_triplets(
    entity_names=["Barack Obama", "Obama"]
)[:5]

结果分析

DynamicLLMPathExtractor生成的知识图谱在多样性和一致性之间取得了平衡。它能够捕捉到一些重要的关系，而这些关系可能是基于架构的方法所忽略的。

代码示例

from llama_index.core import Document, PropertyGraphIndex
from llama_index.core.indices.property_graph import DynamicLLMPathExtractor
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

import wikipedia
import nest_asyncio

nest_asyncio.apply()

# 设置LLM
llm = OpenAI(temperature=0.0, model="gpt-3.5-turbo")

# 设置全局配置
Settings.llm = llm
Settings.chunk_size = 2048
Settings.chunk_overlap = 20

# 从维基百科获取原始文本
def get_wikipedia_content(title):
    try:
        page = wikipedia.page(title)
        return page.content
    except wikipedia.exceptions.DisambiguationError as e:
        print(f"Disambiguation page. Options: {e.options}")
    except wikipedia.exceptions.PageError:
        print(f"Page '{title}' does not exist.")
    return None

wiki_title = "Barack Obama"
content = get_wikipedia_content(wiki_title)

if content:
    document = Document(text=content, metadata={"title": wiki_title})
    print(
        f"Fetched content for '{wiki_title}' (length: {len(content)} characters)"
    )
else:
    print("Failed to fetch Wikipedia content.")

# 无初始架构
kg_extractor = DynamicLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    num_workers=4,
    allowed_entity_types=None,
    allowed_relation_types=None,
    allowed_relation_props=[],
    allowed_entity_props=[],
)

dynamic_index = PropertyGraphIndex.from_documents(
    [document],
    llm=llm,
    embed_kg_nodes=False,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

dynamic_index.property_graph_store.save_networkx_graph(
    name="./DynamicGraph.html"
)

dynamic_index.property_graph_store.get_triplets(
    entity_names=["Barack Obama", "Obama"]
)[:5]

# 有初始架构
kg_extractor = DynamicLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    num_workers=4,
    allowed_entity_types=["POLITICIAN", "POLITICAL_PARTY"],
    allowed_relation_types=["PRESIDENT_OF", "MEMBER_OF"],
    allowed_relation_props=["description"],
    allowed_entity_props=["description"],
)

dynamic_index_2 = PropertyGraphIndex.from_documents(
    [document],
    llm=llm,
    embed_kg_nodes=False,
    kg_extractors=[kg_extractor],
    show_progress=True,
)

dynamic_index_2.property_graph_store.save_networkx_graph(
    name="./DynamicGraph_2.html"
)

dynamic_index_2.property_graph_store.get_triplets(
    entity_names=["Barack Obama", "Obama"]
)[:5]

总结

通过Llama Index的DynamicLLMPathExtractor，我们可以在多样性和一致性之间取得平衡，构建灵活且结构化的知识图谱。DynamicLLMPathExtractor的设计思路清晰，能够帮助程序员更好地理解和应用知识图谱技术。希望这篇博客能帮助你更好地理解和应用知识图谱技术。