微软Graph RAG实践-quick start

最新推荐文章于 2025-03-30 20:24:26 发布

rommel rain

最新推荐文章于 2025-03-30 20:24:26 发布

阅读量465

点赞数 4

文章标签： microsoft python 知识图谱人工智能

本文链接：https://blog.csdn.net/qq_52024723/article/details/142851541

版权

quick-start

按照官方示例运行：Get Started，安装完必要的依赖并完成空间初始化之后，需要在文件上做一些修改

修改ragtest/settings.yaml文件：

由于没有使用openai的LLM api，所以需要在llm和embeddings.llm下添加api_base，值为个人使用的接口base_url
注意自己使用的LLM和嵌入模型的base_url可能不一样，这时候要修改api_key
将chunks.size改成300

为了节约成本，本人使用了deepseek-chat api作为LLM，使用agicto提供的text-embedding-3-small作为嵌入模型。settings.yaml文件修改为：

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: deepseek-chat
  model_supports_json: true # recommended if this is available for your model.
  api_base: https://api.deepseek.com


parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: '自己的apikey'
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    api_base: https://api.agicto.cn/v1
    
chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"

storage:
  type: file # or blob
  base_dir: "output"

reporting:
  type: file # or console, blob
  base_dir: "output"

entity_extraction:
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:

global_search:

另外，同文件夹下的 .env文件也要修改，它对应了settings.yaml文件中的${GRAPHRAG_API_KEY}

其他事项

一些可能会用到的改动：

第一次使用，为了节省时间和金钱，可以将book.txt改成文本量较小的文件。
如果出现报错，可以查看ragtest/output/indexing-engine.log日志；这个日志同样也可以用来查看程序运行轨迹，辅助理解代码
如果自己的语料是中文，同时希望自己的输出也是中文，那么就可以修改ragtest/prompts中的文件，这个文件夹中的文件是提供给LLM的提示，不需要把所有提示英译汉，只需要把提示中的“in English”改成“in Chinese”即可；如果提示没有指明输出使用何种语言，则人工指定输出in Chinese