Llama Index中的DynamicLLMPathExtractor:构建知识图谱的灵活方法
在现代数据科学和人工智能领域,知识图谱(Knowledge Graph)已成为处理复杂信息的重要工具。知识图谱通过结构化的方式表示实体及其关系,使得信息的检索和理解变得更加高效。本文将深入探讨Llama Index中的DynamicLLMPathExtractor
,帮助程序员全面理解其工作原理及实际应用。
前置知识
在开始之前,确保你具备以下基础知识:
- Python基础:熟悉Python编程。
- OpenAI API密钥:你需要一个OpenAI API密钥来使用
OpenAI
模型。 - Llama Index:使用
pip install llama-index
安装Llama Index库。
环境设置
首先,让我们通过安装所需的包并配置OpenAI API密钥来设置环境。
# 安装Llama Index
%pip install llama_index pyvis wikipedia
# 设置OpenAI API密钥
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
# 配置日志
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
# 导入必要的库
from llama_index.core import Document, PropertyGraphIndex
from llama_index.core.indices.property_graph import DynamicLLMPathExtractor
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
import wikipedia
import nest_asyncio
nest_asyncio.apply()
设置LLM后端
# 设置LLM
llm = OpenAI(temperature=0.0, model="gpt-3.5-turbo")
# 设置全局配置
Settings.llm = llm
Settings.chunk_size = 2048
Settings.chunk_overlap = 20
从维基百科获取原始文本
def get_wikipedia_content(title):
try:
page = wikipedia.page(title)
return page.content
except wikipedia.exceptions.DisambiguationError as e:
print(f"Disambiguation page. Options: {e.options}")
except wikipedia.exceptions.PageError:
print(f"Page '{title}' does not exist.")
return None
wiki_title = "Barack Obama"
content = get_wikipedia_content(wiki_title)
if content:
document = Document(text=content, metadata={"title": wiki_title})
print(
f"Fetched content for '{wiki_title}' (length: {len(content)} characters)"
)
else:
print("Failed to fetch Wikipedia content.")
DynamicLLMPathExtractor
DynamicLLMPathExtractor
是一种灵活的路径提取器,它结合了SimpleLLMPathExtractor
的灵活性和SchemaLLMPathExtractor
的一些初始架构指导。它可以根据需要扩展实体和关系类型,同时保持一定的一致性。
无初始架构
在没有初始架构的情况下,DynamicLLMPathExtractor
允许LLM完全自由地推断架构,根据其最佳判断进行标记。
kg_extractor = DynamicLLMPathExtractor(
llm=llm,
max_triplets_per_chunk=20,
num_workers=4,
allowed_entity_types=None,
allowed_relation_types=None,
allowed_relation_props=[],
allowed_entity_props=[],
)
dynamic_index = PropertyGraphIndex.from_documents(
[document],
llm=llm,
embed_kg_nodes=False,
kg_extractors=[kg_extractor],
show_progress=True,
)
dynamic_index.property_graph_store.save_networkx_graph(
name="./DynamicGraph.html"
)
dynamic_index.property_graph_store.get_triplets(
entity_names=["Barack Obama", "Obama"]
)[:5]
有初始架构
在有初始架构的情况下,DynamicLLMPathExtractor
提供了一些初始实体和关系类型,以指导LLM在标记过程中的决策。这并不保证LLM会使用这些实体和关系,但它提供了一些想法,LLM可以根据需要进行扩展。
kg_extractor = DynamicLLMPathExtractor(
llm=llm,
max_triplets_per_chunk=20,
num_workers=4,
allowed_entity_types=["POLITICIAN", "POLITICAL_PARTY"],
allowed_relation_types=["PRESIDENT_OF", "MEMBER_OF"],
allowed_relation_props=["description"],
allowed_entity_props=["description"],
)
dynamic_index_2 = PropertyGraphIndex.from_documents(
[document],
llm=llm,
embed_kg_nodes=False,
kg_extractors=[kg_extractor],
show_progress=True,
)
dynamic_index_2.property_graph_store.save_networkx_graph(
name="./DynamicGraph_2.html"
)
dynamic_index_2.property_graph_store.get_triplets(
entity_names=["Barack Obama", "Obama"]
)[:5]
结果分析
DynamicLLMPathExtractor
生成的知识图谱在多样性和一致性之间取得了平衡。它能够捕捉到一些重要的关系,而这些关系可能是基于架构的方法所忽略的。
代码示例
from llama_index.core import Document, PropertyGraphIndex
from llama_index.core.indices.property_graph import DynamicLLMPathExtractor
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
import wikipedia
import nest_asyncio
nest_asyncio.apply()
# 设置LLM
llm = OpenAI(temperature=0.0, model="gpt-3.5-turbo")
# 设置全局配置
Settings.llm = llm
Settings.chunk_size = 2048
Settings.chunk_overlap = 20
# 从维基百科获取原始文本
def get_wikipedia_content(title):
try:
page = wikipedia.page(title)
return page.content
except wikipedia.exceptions.DisambiguationError as e:
print(f"Disambiguation page. Options: {e.options}")
except wikipedia.exceptions.PageError:
print(f"Page '{title}' does not exist.")
return None
wiki_title = "Barack Obama"
content = get_wikipedia_content(wiki_title)
if content:
document = Document(text=content, metadata={"title": wiki_title})
print(
f"Fetched content for '{wiki_title}' (length: {len(content)} characters)"
)
else:
print("Failed to fetch Wikipedia content.")
# 无初始架构
kg_extractor = DynamicLLMPathExtractor(
llm=llm,
max_triplets_per_chunk=20,
num_workers=4,
allowed_entity_types=None,
allowed_relation_types=None,
allowed_relation_props=[],
allowed_entity_props=[],
)
dynamic_index = PropertyGraphIndex.from_documents(
[document],
llm=llm,
embed_kg_nodes=False,
kg_extractors=[kg_extractor],
show_progress=True,
)
dynamic_index.property_graph_store.save_networkx_graph(
name="./DynamicGraph.html"
)
dynamic_index.property_graph_store.get_triplets(
entity_names=["Barack Obama", "Obama"]
)[:5]
# 有初始架构
kg_extractor = DynamicLLMPathExtractor(
llm=llm,
max_triplets_per_chunk=20,
num_workers=4,
allowed_entity_types=["POLITICIAN", "POLITICAL_PARTY"],
allowed_relation_types=["PRESIDENT_OF", "MEMBER_OF"],
allowed_relation_props=["description"],
allowed_entity_props=["description"],
)
dynamic_index_2 = PropertyGraphIndex.from_documents(
[document],
llm=llm,
embed_kg_nodes=False,
kg_extractors=[kg_extractor],
show_progress=True,
)
dynamic_index_2.property_graph_store.save_networkx_graph(
name="./DynamicGraph_2.html"
)
dynamic_index_2.property_graph_store.get_triplets(
entity_names=["Barack Obama", "Obama"]
)[:5]
总结
通过Llama Index的DynamicLLMPathExtractor
,我们可以在多样性和一致性之间取得平衡,构建灵活且结构化的知识图谱。DynamicLLMPathExtractor
的设计思路清晰,能够帮助程序员更好地理解和应用知识图谱技术。希望这篇博客能帮助你更好地理解和应用知识图谱技术。