微软开源GraphRAG的使用教程-使用自定义数据测试GraphRAG

luxinfeng666

已于 2024-07-08 16:13:13 修改

阅读量2.6w

点赞数 40

分类专栏：人工智能学习笔记文章标签： microsoft 开源 RAG GraphRAG

于 2024-07-08 09:30:00 首次发布

本文链接：https://blog.csdn.net/luxinfeng666/article/details/140253451

版权

在这里插入图片描述

微软在今年4月份的时候提出了GraphRAG的概念，然后在上周开源了GraphRAG,Github链接见https://github.com/microsoft/graphrag,截止当前，已有6900+Star。

安装教程

官方推荐使用Python3.10-3.12版本，我使用Python3.10版本安装时，在初始化项目过程中会报错，切换到Python3.11版本后运行正常，推测是Python3.10与微软的一些最新的SDK不兼容。所以建议使用Python3.11的环境，安装GraphRAG比较简单，直接下面一行代码即可安装成功。

pip install graphrag

使用教程

在这个教程中，我们使用马伯庸的《太白金星有点烦》这个短篇小说为例，测试下使用微软开源的GraphRAG的处理效果。

注意，GraphRAG是使用LLM来提取文本片段中的实体关系，因此耗费Token数较多，如果是个人调研使用，不建议使用GPT4级别的模型（费用太高，不差钱的大佬请忽略此条建议）。综合成本和效果，我这里使用的是DeepSeek-Chat模型。

初始化项目

我这边先创建了一个临时测试目录myTest，然后按照官方教程，在myTest目录下创建了input目录，并把《太白金星有点烦》这本书的txt版本重命名为book.txt后放到input目录下。然后调用python -m graphrag.index --init 进行初始化工作，生成一些配置文件。

mkdir ./myTest/input
curl https://www.xxx.com/太白金星有点烦.txt > ./myTest/input/book.txt  // 这里是示例代码，大家在测试时根据实际情况放入自己要测试的txt文本即可。
cd ./myTest
python -m graphrag.index --init

执行完成后，会在当前目录（即MyTest）目录下生成几个新的文件夹：output-后续执行生成的中间结果会保存到这个目录中；prompts-处理过程中用到的一些Prompt内容；.env-大模型API配置文件，里面默认就一个GRAPHRAG_API_KEY 用于配置大模型的apiKey；settings.yaml-该文件是整体的配置信息，如果我们使用的非OPENAI的官方模型和官方API，我们需要修改此配置文件来让GraphRAG按照我们指定的配置文件执行。

配置相关文件

先在.env文件中配置大模型API的Key，这个配置是全局生效的。我们在.env文件中配置完成后，不需要在settings.yaml文件中重复配置。settings.yaml中使用的默认模型为gpt-4-turbo-preview ，如果不需要修改模型以及调用的API地址，那现在就已经配置完成了，后续的配置内容可以执行忽略并直接到执行阶段。

我这里使用的是agicto 提供的APIkey(主要是新用户注册可以免费获取到10块钱的调用额度，白嫖还是挺爽的)。我在这里主要就修改了API地址和调用模型的名称，修改完成后的settings文件完整内容如下：

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${
   GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: deepseek-chat
  model_supports_json: false # recommended if this is available for your model.
  api_base: https://api.agicto.cn/v1
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${
   GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    api_base: https://api.agicto.cn/v1
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional
  

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

执行并构建图索引

此流程是GraphRAG的核心流程，即构建基于图的知识库用于后续的问答环节，通过以下代码即可触发执行。

python -m graphrag.index

基于微软在论文中提到的实现思路，执行过程GraphRAG主要实现了如下功能：

Source Documents → Text Chunks：将源文档分割成文本块。
Text Chunks → Element Instances：从每个文本块中提取图节点和边的实例。
Element Instances → Element Summaries：为每个图元素生成摘要。
Element Summaries → Graph Communities：使用社区检测算法将图划分为社区。
Graph Communities → Community Summaries：为每个社区生成摘要。
Community Summaries → Community Answers → Global Answer：使用社区摘要生成局部答案，然后汇总这些局部答案以生成全局答案。

整体执行耗时与具体的文本大小有关。我这个例子整体耗时大概20分钟，耗费人民币大约4块钱。执行过程中的输出如下：


🚀 Reading settings from settings.yaml
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will 
be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
🚀 create_base_text_un

最低0.47元/天解锁文章