快速上手Graphrag:安装、索引和查询指南

简介

本文详细介绍了如何安装和配置 graphrag包括通过pip和源码安装的方法。在源码安装部分,具体演示了克隆源码、配置python环境以及使用poetry安装依赖。此外,详细介绍了如何使用豆包大模型,创建索引、执行局部搜索和全局搜索,以及通过Docker安装Neo4j以进行知识图谱的可视化展示。最后,介绍了如何根据特定领域自定义prompt,以便更好地适应不同的应用场景。

安装

pip直接安装

pip install graphrag

因为pip的直接使用包的方式不方便修改源码,故采用源码安装的方式,同时下文皆基于源码安装的方式执行。

源码安装

克隆源码:

git clone https://github.com/microsoft/graphrag.git

python环境安装(如有可跳过,使用pyenv管理环境):

# macos中pyenv安装命令,其余系统自行查找
brew install pyenv  

# 配置环境变量
export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init --path)"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

# 试了3.12,但是poetry环境构建失败,建议3.10-3.11
pyenv install 3.11.9  
pyenv global 3.11.9

安装poetry

# macos中poetry安装命令,其余系统自行查找
brew install poetry

# 如果使用的pyenv可以指定poetry使用当前pyenv的py版本
poetry env use $(pyenv which python)

# 进入文件夹
cd graphrag

# 安装依赖
poetry install
poetry shell

创建索引

创建文件夹

mkdir Q

初始化文件夹Q

poetry run poe index --init --root Q
# 非源码安装命令  python -m graphrag.index --init --root Q

初始化后的Q
Q

├── .env

├── output

│ └── 20240726-142724

│ └── reports

│ └── indexing-engine.log

├── prompts

│ ├── claim_extraction.txt

│ ├── community_report.txt

│ ├── entity_extraction.txt

│ └── summarize_descriptions.txt

└── settings.yaml

再创建 cache 以及 input文件夹,并放入需要创建索引的文件,放入input的文件必须是txt格式

Q

├── .env

├── cache

├── input

│ └── 阿Q正传.txt

├── output

│ └── 20240726-142724

│ └── reports

│ └── indexing-engine.log

├── prompts

│ ├── claim_extraction.txt

│ ├── community_report.txt

│ ├── entity_extraction.txt

│ └── summarize_descriptions.txt

└── settings.yaml

接下来需要修改 .env 中的 key 以及配置文件settings.yaml

我使用的是火山中的豆包系列,其中的大模型以及向量化模型均有类openai的接口

settings.yaml需要修改 llm 和 embedding 中的 apibase 以及 model即可,tpm以及rpm按照实际情况修改即可。

settings.yaml参考

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: 豆包模型id
  model_supports_json: false # recommended if this is available for your model.
  max_tokens: 4000
  request_timeout: 180.0
  api_base: https://ark.cn-beijing.volces.com/api/v3/
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  tokens_per_minute: 800_000 # set a leaky bucket throttle
  requests_per_minute: 10_000 # set a leaky bucket throttle
  max_retries: 10
  max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: 豆包模型id
    api_base: https://ark.cn-beijing.volces.com/api/v3/
    encoding_format: float
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 1 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional
  



chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

修改代码

graphrag/llm/openai/openai_embeddings_llm.py

在_execute_llm中args字典中添加 “encoding_format”: “float”。豆包无此参数会报错

async def _execute_llm(
    self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
) -> EmbeddingOutput | None:
    args = {
        "model": self.configuration.model,
        "encoding_format": "float",
        **(kwargs.get("model_parameters") or {}),
    }
    embedding = await self.client.embeddings.create(
        input=input,
        **args,
    )
    return [d.embedding for d in embedding.data]

建立索引

poetry run poe index --root Q
# 非源码安装命令 python -m graphrag.index --root Q

过程可能比较慢,若文档较大,token的消耗量也很大,请注意自己的额度
等待成功即可出现 "All workflows completed successfully."即安装成功
在这里插入图片描述

查询

局部搜索(Local Search)

应用场景

  1. 具体实体查询:当用户的问题需要理解输入文档中提到的特定实体时,局部搜索非常有效。例如,“洋甘菊有哪些治疗功效?” 这种问题需要详细了解某个实体及其相关信息。
  2. 即时响应需求:局部搜索方法通过结合知识图谱中的结构化数据和输入文档中的非结构化数据,可以快速提供与查询相关的上下文信息,从而生成及时且准确的回答。
  3. 细粒度信息检索:对于需要从原始文档中提取并关联具体文本片段的问题,局部搜索能够识别出语义相关的实体,并将这些实体作为访问点来获取更多相关细节。

因为使用豆包的embedding故局部搜索需要修改 graphrag/query/llm/oai/embedding.py 中的_embed_with_retry;大致在121行

   def _embed_with_retry(
       self, text: str | tuple, **kwargs: Any
   ) -> tuple[list[float], int]:
       try:
           retryer = Retrying(
               stop=stop_after_attempt(self.max_retries),
               wait=wait_exponential_jitter(max=10),
               reraise=True,
               retry=retry_if_exception_type(self.retry_error_types),
           )
           for attempt in retryer:
               if isinstance(text, tuple):
                   text = [str(i) for i in text]
               with attempt:
                   embedding = (
                       self.sync_client.embeddings.create(  # type: ignore
                           input=text,
                           model=self.model,
                           encoding_format="float",
                           **kwargs,  # type: ignore
                       )
                       .data[0]
                       .embedding
                       or []
                   )
                   return (embedding, len(text))
       except RetryError as e:
           self._reporter.error(
               message="Error at embed_with_retry()",
               details={self.__class__.__name__: str(e)},
           )
           return ([], 0)
       else:
           # TODO: why not just throw in this case?
           return ([], 0)
poetry run poe query --root Q --method local "阿Q的主要经历有哪些"
# 非源码安装命令 python -m graphrag.query --root Q --method local '阿Q的主要经历有哪些'
SUCCESS: Local Search Response: **一、与赵太爷的纠葛**

阿 Q 自称与赵太爷是本家,却遭到赵太爷的责骂和殴打。赵太爷对阿 Q 的态度变化反映了其地位和权威。[Data: Relationships (0)]

**二、与假洋鬼子的矛盾**

假洋鬼子限制阿 Q 的行动并打了他,阿 Q 对假洋鬼子心怀愤恨。[Data: Relationships (3)]

**三、向吴妈求爱引发风波**

阿 Q 突然向吴妈求困觉,导致吴妈反应强烈,这一事件在未庄引起了不小的轰动。[Data: Relationships (2)]

**四、在赵家的经历**

阿 Q 在赵家舂米,目睹了赵家遭抢,其对赵家的看法也有所变化。[Data: Relationships (8)]

**五、对革命的想法和遭遇**

阿 Q 对革命有想法并声称要参与,其遭遇也受到相关人物决策的影响。但他的革命愿望并未得到实现。[Data: Relationships (22)] 

全局搜索(Global Search)

应用场景

  1. 主题聚合查询:全局搜索非常适用于需要跨数据集进行信息聚合的问题。例如,“数据中的前五大主题是什么?” 此类问题需要对整个数据集进行分析和总结,而不是仅仅依赖于单个或几个文档。
  2. 数据组织和预总结:全局搜索利用LLM生成的知识图谱的结构,将私有数据集组织成有意义的语义簇。这些簇已经被预先总结,使得在响应用户查询时可以更高效地提供概览性的信息。
  3. 复杂查询处理:对于那些需要综合多个数据源、关系和主题的复杂查询,全局搜索能够提供更全面的答案,因为它基于知识图谱的整体结构来理解和回应用户的问题。
    全局查询
poetry run poe query --root Q --method global "这篇文章主要揭示了什么"
# 非源码安装命令 python -m graphrag.query --root Q --method global '这篇文章主要揭示了什么'
SUCCESS: Global Search Response: **一、人物关系**
文章主要揭示了未庄人物之间复杂的关系,例如阿 Q 与赵太爷、假洋鬼子、吴妈等人的互动[Data: (5, 6, 1, +more)]。

**二、地点影响**
展示了未庄作为故事核心地点,其环境和氛围对人物命运和行为有着重要作用[Data: (6, +more)]。

**三、革命参与**
提到了一些人物在革命中的角色和参与情况,像举人老爷、秀才等[Data: (1, +more)]。 

总结

  • 局部搜索 更适合回答涉及具体实体和细节的问题,能够迅速从相关文档中提取所需信息。
  • 全局搜索 则擅长处理需要跨数据集整合和总结的信息查询,通过预定义的语义簇来提供更宏观的视角和综合性的答案。

知识图谱可视化

使用docker安装neo4j

docker run \
    -p 7474:7474 -p 7687:7687 \
    --name neo4j-apoc \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4J_PLUGINS=\[\"apoc\"\] \
    neo4j:latest

访问http://localhost:7474/browser/并修改密码

初始用户名:neo4j

初始密码:neo4j

修改数据库参数以及output地址并执行以下代码

import pandas as pd
from neo4j import GraphDatabase
import time



NEO4J_URI = "neo4j://localhost"  # or neo4j+s://xxxx.databases.neo4j.io
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "12345678"
NEO4J_DATABASE = "neo4j"
GRAPHRAG_FOLDER = "./output/20240724-151213/artifacts"
# Create a Neo4j driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
statements = """
create constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique;
create constraint document_id if not exists for (d:__Document__) require d.id is unique;
create constraint entity_id if not exists for (c:__Community__) require c.community is unique;
create constraint entity_id if not exists for (e:__Entity__) require e.id is unique;
create constraint entity_title if not exists for (e:__Entity__) require e.name is unique;
create constraint entity_title if not exists for (e:__Covariate__) require e.title is unique;
create constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique;
""".split(";")
for statement in statements:
    if len((statement or "").strip()) > 0:
        print(statement)
        driver.execute_query(statement)

def batched_import(statement, df, batch_size=1000):
    """
    Import a dataframe into Neo4j using a batched approach.
    Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch.
    """
    total = len(df)
    start_s = time.time()
    for start in range(0, total, batch_size):
        batch = df.iloc[start: min(start + batch_size, total)]
        result = driver.execute_query("UNWIND $rows AS value " + statement,
                                      rows=batch.to_dict('records'),
                                      database_=NEO4J_DATABASE)
        print(result.summary.counters)
    print(f'{total} rows in {time.time() - start_s} s.')
    return total

doc_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_documents.parquet', columns=["id", "title"])
doc_df.head(2)
# import documents
statement = """
MERGE (d:__Document__ {id:value.id})
SET d += value {.title}
"""
batched_import(statement, doc_df)
text_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_text_units.parquet',
                          columns=["id", "text", "n_tokens", "document_ids"])
text_df.head(2)
statement = """
MERGE (c:__Chunk__ {id:value.id})
SET c += value {.text, .n_tokens}
WITH c, value
UNWIND value.document_ids AS document
MATCH (d:__Document__ {id:document})
MERGE (c)-[:PART_OF]->(d)
"""
batched_import(statement, text_df)
entity_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_entities.parquet',
                            columns=["name", "type", "description", "human_readable_id", "id", "description_embedding",
                                     "text_unit_ids"])
entity_df.head(2)
entity_statement = """
MERGE (e:__Entity__ {id:value.id})
SET e += value {.human_readable_id, .description, name:replace(value.name,'"','')}
WITH e, value
CALL db.create.setNodeVectorProperty(e, "description_embedding", value.description_embedding)
CALL apoc.create.addLabels(e, case when coalesce(value.type,"") = "" then [] else [apoc.text.upperCamelCase(replace(value.type,'"',''))] end) yield node
UNWIND value.text_unit_ids AS text_unit
MATCH (c:__Chunk__ {id:text_unit})
MERGE (c)-[:HAS_ENTITY]->(e)
"""
batched_import(entity_statement, entity_df)
rel_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_relationships.parquet',
                         columns=["source", "target", "id", "rank", "weight", "human_readable_id", "description",
                                  "text_unit_ids"])
rel_df.head(2)
rel_statement = """
    MATCH (source:__Entity__ {name:replace(value.source,'"','')})
    MATCH (target:__Entity__ {name:replace(value.target,'"','')})
    // not necessary to merge on id as there is only one relationship per pair
    MERGE (source)-[rel:RELATED {id: value.id}]->(target)
    SET rel += value {.rank, .weight, .human_readable_id, .description, .text_unit_ids}
    RETURN count(*) as createdRels
"""
batched_import(rel_statement, rel_df)
community_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_communities.parquet',
                               columns=["id", "level", "title", "text_unit_ids", "relationship_ids"])
community_df.head(2)
statement = """
MERGE (c:__Community__ {community:value.id})
SET c += value {.level, .title}
/*
UNWIND value.text_unit_ids as text_unit_id
MATCH (t:__Chunk__ {id:text_unit_id})
MERGE (c)-[:HAS_CHUNK]->(t)
WITH distinct c, value
*/
WITH *
UNWIND value.relationship_ids as rel_id
MATCH (start:__Entity__)-[:RELATED {id:rel_id}]->(end:__Entity__)
MERGE (start)-[:IN_COMMUNITY]->(c)
MERGE (end)-[:IN_COMMUNITY]->(c)
RETURn count(distinct c) as createdCommunities
"""
batched_import(statement, community_df)
community_report_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_community_reports.parquet',
                                      columns=["id", "community", "level", "title", "summary", "findings", "rank",
                                               "rank_explanation", "full_content"])
community_report_df.head(2)
# import communities
community_statement = """MATCH (c:__Community__ {community: value.community})
SET c += value {.level, .title, .rank, .rank_explanation, .full_content, .summary}
WITH c, value
UNWIND range(0, size(value.findings)-1) AS finding_idx
WITH c, value, finding_idx, value.findings[finding_idx] as finding
MERGE (c)-[:HAS_FINDING]->(f:Finding {id: finding_idx})
SET f += finding"""
batched_import(community_statement, community_report_df)



成功后回到http://localhost:7474/browser/即可查看图谱
在这里插入图片描述

自定义prompt

可以使用自带的方法根据想要的主题生成prompt

poetry run  poe prompt_tune --root aQ --domain "a software engineering code" --method random --limit 2 --chunk-size 500 --output prompt-project
# 非源码安装命令 python -m graphrag.prompt_tune --root aQ --domain 'a software engineering code' --method random --limit 2 --chunk-size 500 --output prompt-project

  • root - 指定配置yaml位置和输入文件位置

  • domain - 指定适配领域

  • method - 指定如何选取文档作为适配参考,可选all, random和top

  • limit - 在指定method为random或者top时,设置加载文件数量

  • max-tokens - 设置生成prompt的最大tokens数量

  • chunk-size - 设置chunk大小

  • language - 设置适配的语言

  • no-entity-type - 使用未分类实体提取

  • output - 设置生成的prompt位置,不然会直接覆盖默认的prompts
    主要会生成3个prompt文件

  • community_report.txt

  • entity_extraction.txt

  • summarize_descriptions.txt
    修改settings文件,指定自动生成的prompts,并修改需要提取的实体:
    在这里插入图片描述

可能是模型的缘故生成的prompt不是很符合我想要的结果,于是对默认的prompt进行手调prompt。

主要修改entity_extraction.txt,修改 -Goal- 为自己设定的领域,再修改 -Steps- 中的 entity_type。将 -Steps- 中的prompt以及默认文件中的参考例子,利用 GPT 生成一些符合自己想要的领域的例子加进去。

其余的两个prompt都可以用同样的方式用GPT洗一次即可。

  • 10
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

我爱让机器学习

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值