简介
本文详细介绍了如何安装和配置 graphrag
包括通过pip和源码安装的方法。在源码安装部分,具体演示了克隆源码、配置python环境以及使用poetry安装依赖。此外,详细介绍了如何使用豆包大模型,创建索引、执行局部搜索和全局搜索,以及通过Docker安装Neo4j以进行知识图谱的可视化展示。最后,介绍了如何根据特定领域自定义prompt,以便更好地适应不同的应用场景。
安装
pip直接安装
pip install graphrag
因为pip的直接使用包的方式不方便修改源码,故采用源码安装的方式,同时下文皆基于源码安装的方式执行。
源码安装
克隆源码:
git clone https://github.com/microsoft/graphrag.git
python环境安装(如有可跳过,使用pyenv管理环境):
# macos中pyenv安装命令,其余系统自行查找
brew install pyenv
# 配置环境变量
export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init --path)"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
# 试了3.12,但是poetry环境构建失败,建议3.10-3.11
pyenv install 3.11.9
pyenv global 3.11.9
安装poetry
# macos中poetry安装命令,其余系统自行查找
brew install poetry
# 如果使用的pyenv可以指定poetry使用当前pyenv的py版本
poetry env use $(pyenv which python)
# 进入文件夹
cd graphrag
# 安装依赖
poetry install
poetry shell
创建索引
创建文件夹
mkdir Q
初始化文件夹Q
poetry run poe index --init --root Q
# 非源码安装命令 python -m graphrag.index --init --root Q
初始化后的Q
Q
├── .env
├── output
│ └── 20240726-142724
│ └── reports
│ └── indexing-engine.log
├── prompts
│ ├── claim_extraction.txt
│ ├── community_report.txt
│ ├── entity_extraction.txt
│ └── summarize_descriptions.txt
└── settings.yaml
再创建 cache 以及 input文件夹,并放入需要创建索引的文件,放入input的文件必须是txt格式
Q
├── .env
├── cache
├── input
│ └── 阿Q正传.txt
├── output
│ └── 20240726-142724
│ └── reports
│ └── indexing-engine.log
├── prompts
│ ├── claim_extraction.txt
│ ├── community_report.txt
│ ├── entity_extraction.txt
│ └── summarize_descriptions.txt
└── settings.yaml
接下来需要修改 .env 中的 key 以及配置文件settings.yaml
我使用的是火山中的豆包系列,其中的大模型以及向量化模型均有类openai的接口
settings.yaml需要修改 llm 和 embedding 中的 apibase 以及 model即可,tpm以及rpm按照实际情况修改即可。
settings.yaml参考
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: 豆包模型id
model_supports_json: false # recommended if this is available for your model.
max_tokens: 4000
request_timeout: 180.0
api_base: https://ark.cn-beijing.volces.com/api/v3/
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
tokens_per_minute: 800_000 # set a leaky bucket throttle
requests_per_minute: 10_000 # set a leaky bucket throttle
max_retries: 10
max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# temperature: 0 # temperature for sampling
# top_p: 1 # top-p sampling
# n: 1 # Number of completions to generate
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: 豆包模型id
api_base: https://ark.cn-beijing.volces.com/api/v3/
encoding_format: float
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 1 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 1200
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
global_search:
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
修改代码
graphrag/llm/openai/openai_embeddings_llm.py
在_execute_llm中args字典中添加 “encoding_format”: “float”。豆包无此参数会报错
async def _execute_llm(
self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
) -> EmbeddingOutput | None:
args = {
"model": self.configuration.model,
"encoding_format": "float",
**(kwargs.get("model_parameters") or {}),
}
embedding = await self.client.embeddings.create(
input=input,
**args,
)
return [d.embedding for d in embedding.data]
建立索引
poetry run poe index --root Q
# 非源码安装命令 python -m graphrag.index --root Q
过程可能比较慢,若文档较大,token的消耗量也很大,请注意自己的额度
等待成功即可出现 "All workflows completed successfully."即安装成功
查询
局部搜索(Local Search)
应用场景
- 具体实体查询:当用户的问题需要理解输入文档中提到的特定实体时,局部搜索非常有效。例如,“洋甘菊有哪些治疗功效?” 这种问题需要详细了解某个实体及其相关信息。
- 即时响应需求:局部搜索方法通过结合知识图谱中的结构化数据和输入文档中的非结构化数据,可以快速提供与查询相关的上下文信息,从而生成及时且准确的回答。
- 细粒度信息检索:对于需要从原始文档中提取并关联具体文本片段的问题,局部搜索能够识别出语义相关的实体,并将这些实体作为访问点来获取更多相关细节。
因为使用豆包的embedding故局部搜索需要修改 graphrag/query/llm/oai/embedding.py 中的_embed_with_retry;大致在121行
def _embed_with_retry(
self, text: str | tuple, **kwargs: Any
) -> tuple[list[float], int]:
try:
retryer = Retrying(
stop=stop_after_attempt(self.max_retries),
wait=wait_exponential_jitter(max=10),
reraise=True,
retry=retry_if_exception_type(self.retry_error_types),
)
for attempt in retryer:
if isinstance(text, tuple):
text = [str(i) for i in text]
with attempt:
embedding = (
self.sync_client.embeddings.create( # type: ignore
input=text,
model=self.model,
encoding_format="float",
**kwargs, # type: ignore
)
.data[0]
.embedding
or []
)
return (embedding, len(text))
except RetryError as e:
self._reporter.error(
message="Error at embed_with_retry()",
details={self.__class__.__name__: str(e)},
)
return ([], 0)
else:
# TODO: why not just throw in this case?
return ([], 0)
poetry run poe query --root Q --method local "阿Q的主要经历有哪些"
# 非源码安装命令 python -m graphrag.query --root Q --method local '阿Q的主要经历有哪些'
SUCCESS: Local Search Response: **一、与赵太爷的纠葛**
阿 Q 自称与赵太爷是本家,却遭到赵太爷的责骂和殴打。赵太爷对阿 Q 的态度变化反映了其地位和权威。[Data: Relationships (0)]
**二、与假洋鬼子的矛盾**
假洋鬼子限制阿 Q 的行动并打了他,阿 Q 对假洋鬼子心怀愤恨。[Data: Relationships (3)]
**三、向吴妈求爱引发风波**
阿 Q 突然向吴妈求困觉,导致吴妈反应强烈,这一事件在未庄引起了不小的轰动。[Data: Relationships (2)]
**四、在赵家的经历**
阿 Q 在赵家舂米,目睹了赵家遭抢,其对赵家的看法也有所变化。[Data: Relationships (8)]
**五、对革命的想法和遭遇**
阿 Q 对革命有想法并声称要参与,其遭遇也受到相关人物决策的影响。但他的革命愿望并未得到实现。[Data: Relationships (22)]
全局搜索(Global Search)
应用场景
- 主题聚合查询:全局搜索非常适用于需要跨数据集进行信息聚合的问题。例如,“数据中的前五大主题是什么?” 此类问题需要对整个数据集进行分析和总结,而不是仅仅依赖于单个或几个文档。
- 数据组织和预总结:全局搜索利用LLM生成的知识图谱的结构,将私有数据集组织成有意义的语义簇。这些簇已经被预先总结,使得在响应用户查询时可以更高效地提供概览性的信息。
- 复杂查询处理:对于那些需要综合多个数据源、关系和主题的复杂查询,全局搜索能够提供更全面的答案,因为它基于知识图谱的整体结构来理解和回应用户的问题。
全局查询
poetry run poe query --root Q --method global "这篇文章主要揭示了什么"
# 非源码安装命令 python -m graphrag.query --root Q --method global '这篇文章主要揭示了什么'
SUCCESS: Global Search Response: **一、人物关系**
文章主要揭示了未庄人物之间复杂的关系,例如阿 Q 与赵太爷、假洋鬼子、吴妈等人的互动[Data: (5, 6, 1, +more)]。
**二、地点影响**
展示了未庄作为故事核心地点,其环境和氛围对人物命运和行为有着重要作用[Data: (6, +more)]。
**三、革命参与**
提到了一些人物在革命中的角色和参与情况,像举人老爷、秀才等[Data: (1, +more)]。
总结
- 局部搜索 更适合回答涉及具体实体和细节的问题,能够迅速从相关文档中提取所需信息。
- 全局搜索 则擅长处理需要跨数据集整合和总结的信息查询,通过预定义的语义簇来提供更宏观的视角和综合性的答案。
知识图谱可视化
使用docker安装neo4j
docker run \
-p 7474:7474 -p 7687:7687 \
--name neo4j-apoc \
-e NEO4J_apoc_export_file_enabled=true \
-e NEO4J_apoc_import_file_enabled=true \
-e NEO4J_apoc_import_file_use__neo4j__config=true \
-e NEO4J_PLUGINS=\[\"apoc\"\] \
neo4j:latest
访问http://localhost:7474/browser/并修改密码
初始用户名:neo4j
初始密码:neo4j
修改数据库参数以及output地址并执行以下代码
import pandas as pd
from neo4j import GraphDatabase
import time
NEO4J_URI = "neo4j://localhost" # or neo4j+s://xxxx.databases.neo4j.io
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "12345678"
NEO4J_DATABASE = "neo4j"
GRAPHRAG_FOLDER = "./output/20240724-151213/artifacts"
# Create a Neo4j driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
statements = """
create constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique;
create constraint document_id if not exists for (d:__Document__) require d.id is unique;
create constraint entity_id if not exists for (c:__Community__) require c.community is unique;
create constraint entity_id if not exists for (e:__Entity__) require e.id is unique;
create constraint entity_title if not exists for (e:__Entity__) require e.name is unique;
create constraint entity_title if not exists for (e:__Covariate__) require e.title is unique;
create constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique;
""".split(";")
for statement in statements:
if len((statement or "").strip()) > 0:
print(statement)
driver.execute_query(statement)
def batched_import(statement, df, batch_size=1000):
"""
Import a dataframe into Neo4j using a batched approach.
Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch.
"""
total = len(df)
start_s = time.time()
for start in range(0, total, batch_size):
batch = df.iloc[start: min(start + batch_size, total)]
result = driver.execute_query("UNWIND $rows AS value " + statement,
rows=batch.to_dict('records'),
database_=NEO4J_DATABASE)
print(result.summary.counters)
print(f'{total} rows in {time.time() - start_s} s.')
return total
doc_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_documents.parquet', columns=["id", "title"])
doc_df.head(2)
# import documents
statement = """
MERGE (d:__Document__ {id:value.id})
SET d += value {.title}
"""
batched_import(statement, doc_df)
text_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_text_units.parquet',
columns=["id", "text", "n_tokens", "document_ids"])
text_df.head(2)
statement = """
MERGE (c:__Chunk__ {id:value.id})
SET c += value {.text, .n_tokens}
WITH c, value
UNWIND value.document_ids AS document
MATCH (d:__Document__ {id:document})
MERGE (c)-[:PART_OF]->(d)
"""
batched_import(statement, text_df)
entity_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_entities.parquet',
columns=["name", "type", "description", "human_readable_id", "id", "description_embedding",
"text_unit_ids"])
entity_df.head(2)
entity_statement = """
MERGE (e:__Entity__ {id:value.id})
SET e += value {.human_readable_id, .description, name:replace(value.name,'"','')}
WITH e, value
CALL db.create.setNodeVectorProperty(e, "description_embedding", value.description_embedding)
CALL apoc.create.addLabels(e, case when coalesce(value.type,"") = "" then [] else [apoc.text.upperCamelCase(replace(value.type,'"',''))] end) yield node
UNWIND value.text_unit_ids AS text_unit
MATCH (c:__Chunk__ {id:text_unit})
MERGE (c)-[:HAS_ENTITY]->(e)
"""
batched_import(entity_statement, entity_df)
rel_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_relationships.parquet',
columns=["source", "target", "id", "rank", "weight", "human_readable_id", "description",
"text_unit_ids"])
rel_df.head(2)
rel_statement = """
MATCH (source:__Entity__ {name:replace(value.source,'"','')})
MATCH (target:__Entity__ {name:replace(value.target,'"','')})
// not necessary to merge on id as there is only one relationship per pair
MERGE (source)-[rel:RELATED {id: value.id}]->(target)
SET rel += value {.rank, .weight, .human_readable_id, .description, .text_unit_ids}
RETURN count(*) as createdRels
"""
batched_import(rel_statement, rel_df)
community_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_communities.parquet',
columns=["id", "level", "title", "text_unit_ids", "relationship_ids"])
community_df.head(2)
statement = """
MERGE (c:__Community__ {community:value.id})
SET c += value {.level, .title}
/*
UNWIND value.text_unit_ids as text_unit_id
MATCH (t:__Chunk__ {id:text_unit_id})
MERGE (c)-[:HAS_CHUNK]->(t)
WITH distinct c, value
*/
WITH *
UNWIND value.relationship_ids as rel_id
MATCH (start:__Entity__)-[:RELATED {id:rel_id}]->(end:__Entity__)
MERGE (start)-[:IN_COMMUNITY]->(c)
MERGE (end)-[:IN_COMMUNITY]->(c)
RETURn count(distinct c) as createdCommunities
"""
batched_import(statement, community_df)
community_report_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_community_reports.parquet',
columns=["id", "community", "level", "title", "summary", "findings", "rank",
"rank_explanation", "full_content"])
community_report_df.head(2)
# import communities
community_statement = """MATCH (c:__Community__ {community: value.community})
SET c += value {.level, .title, .rank, .rank_explanation, .full_content, .summary}
WITH c, value
UNWIND range(0, size(value.findings)-1) AS finding_idx
WITH c, value, finding_idx, value.findings[finding_idx] as finding
MERGE (c)-[:HAS_FINDING]->(f:Finding {id: finding_idx})
SET f += finding"""
batched_import(community_statement, community_report_df)
成功后回到http://localhost:7474/browser/即可查看图谱
自定义prompt
可以使用自带的方法根据想要的主题生成prompt
poetry run poe prompt_tune --root aQ --domain "a software engineering code" --method random --limit 2 --chunk-size 500 --output prompt-project
# 非源码安装命令 python -m graphrag.prompt_tune --root aQ --domain 'a software engineering code' --method random --limit 2 --chunk-size 500 --output prompt-project
-
root - 指定配置yaml位置和输入文件位置
-
domain - 指定适配领域
-
method - 指定如何选取文档作为适配参考,可选all, random和top
-
limit - 在指定method为random或者top时,设置加载文件数量
-
max-tokens - 设置生成prompt的最大tokens数量
-
chunk-size - 设置chunk大小
-
language - 设置适配的语言
-
no-entity-type - 使用未分类实体提取
-
output - 设置生成的prompt位置,不然会直接覆盖默认的prompts
主要会生成3个prompt文件 -
community_report.txt
-
entity_extraction.txt
-
summarize_descriptions.txt
修改settings文件,指定自动生成的prompts,并修改需要提取的实体:
可能是模型的缘故生成的prompt不是很符合我想要的结果,于是对默认的prompt进行手调prompt。
主要修改entity_extraction.txt,修改 -Goal- 为自己设定的领域,再修改 -Steps- 中的 entity_type。将 -Steps- 中的prompt以及默认文件中的参考例子,利用 GPT 生成一些符合自己想要的领域的例子加进去。
其余的两个prompt都可以用同样的方式用GPT洗一次即可。