背景
GraphRAG 最近在RAG 领域非常火热,又是大厂出品,开源还免费。出于好奇,利用周末时间动手搭建了一个本地运行GraphRAG项目的试验环境。搭建过程有不少坑,在这里记录发布,以帮助需要的人.
本地环境
- 硬件: 2020 款 X_86_64 Mackbook pro, 4C 8G, 集成显卡.
- 软件: Python:3.11, ollma
- LLM: mistral:7b
搭建步骤
- >pip3 install graphrag # version 0.3.1
- >mkdir -p ./graphrag/input # 创建输入文件夹
- >curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./graphrag/input/book.txt # 拉取数据文件查尔斯·狄更斯的《圣诞颂歌》
- >cd ./graphrag
- >python -m graphrag.index --init #初始化工作区,或用这个也可以python -m graphrag.index --init --root ./graphrag
- 修改 settings.yaml # 参考文章末尾配置文件修改
- 修改.env 文件, 添加如下配置.(这是可选项,我加上后直接卡在了辅助声明属性抽取上)
GRAPHRAG_CLAIM_EXTRACTION_ENABLED=True
代码修改
1. 修改: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/graphrag/llm/openai/openai_embeddings_llm.py
注释掉:
'''
embedding = await self.client.embeddings.create(
input=input,
**args,
)
return [d.embedding for d in embedding.data]
'''
新增:
import ollama
embedding_list = []
for inp in input:
embedding = ollama.embeddings(model="mistral:7b",prompt=inp)
embedding_list.append(embedding["embedding"])
return embedding_list
2. 修改 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/graphrag/query/llm/oai/embedding.py
注释掉:
# embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
# chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
# chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
# return chunk_embeddings.tolist()
新增:
import ollama
embedding = ollama.embeddings(model='mistral:7b', prompt=chunk)['embedding']
chunk_len = len(chunk)
return chunk_embeddings
3. 修改 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/graphrag/query/llm
新增:
tokens = token_encoder.decode(tokens) # 将tokens解码成字符串
4. /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/graphrag/prompt_tune/prompt/entity_relationship.py
25 row 修改为
Use {{record_delimiter}} as the list delimiter.
说明: 微软最新的代码已经修复了这个错误.
5. Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/graphrag/query/structured_search/local_search/search.py
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/graphrag/query/structured_search/global_search/search.py
修改所有search_messages:
"""
search_messages = [
{"role": "system", "content": search_prompt},
{"role": "user", "content": query},
]
"""
search_messages = [
{"role": "user", "content": search_prompt + "\n\n ### USER QUESTION ### \n\n" + query}
]
构建&测试
测试数据:小学课文<吃水不忘挖井人>. 原来的圣诞颂歌构建时间太长, 换了个短的.
>python -m graphrag.index #构建图索引. 注意执行命令的路径
构建记录截图:
>python -m graphrag.query --method local "这篇文章的主题是什么?"
报错。猜测与模型适配及训练语料太短有关.
>python -m graphrag.query --method global "毛主席与水井有啥关系?"
运行成功。但结果却很感人:( 个人猜测原因可能是训练样本太短,所用的模型又是处理英文为主的,导致没有提前到足够的信息. 有条件的可以换个对中文支持好的大模型,外加好的测试语料试试.
错误记录
说明:已经正确修改代码和启动模型服务,但依旧有如下错误
1. FilePath: "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-
packages/graspologic/partition/leiden.py"
Error:
-
hierarchical_clusters_native = gn.hierarchical_leiden(
^^^^^^^^^^^^^^^^^^^^^^^
leiden.EmptyNetworkError: EmptyNetworkError
解决办法:从qwen:0.5 切换成mistral:7b就可以。
参考链接: Issue #562 - FilePath: "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/frame.py\"
Error: -
ine 4299, in __setitem__\n self._setitem_array(key, value)\n File \"/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/frame.py\", line 4341, in _setitem_array\n check_key_length(self.columns, key, value)\n File \"/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/indexers/utils.py\", line 390, in check_key_length\n raise ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}
......
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key -
错误原因:LLM seems doesn't understand what prompt says.It may be various reasons such like LLM's max context window, or just services is not working as expect.
解决办法: 将setting.yaml 文件中的 chunks 大小的上限从1200 调整到 300或200.
参考链接: Issue 362 - Error:
raise APITimeoutError(request=request) from err\nopenai.APITimeoutError: Request timed out.\n", "source": "Request timed out.", "details": {"doc_index": 0, "text": ".....
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/openai/_base_client.py", line 1568, in _request
raise APITimeoutError(request=request) from err
openai.APITimeoutError: Request timed out.
解决办法: 修改settings.yaml文件中的 timeout 超时设置。 我改成了1800秒. - 本地查询时报错信息
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lance/dataset.py", line 2704, in _coerce_query_vector
query = pa.FloatingPointArray.from_pandas(query, type=pa.float32())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 1115, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 339, in pyarrow.lib.array
File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: only handle 1-dimensional arrays
解决办法:更换更强大的模型也许能解决。
配置文件参考
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
# api_key: ollama
type: openai_chat # or azure_openai_chat
# model: qwen2:0.5b
model: mistral:7b
model_supports_json: true # recommended if this is available for your model.
# max_tokens: 4000
request_timeout: 1800.0
api_base: http://localhost:11434/v1
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
tokens_per_minute: 150_000 # set a leaky bucket throttle
requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# temperature: 0 # temperature for sampling
# top_p: 1 # top-p sampling
# n: 1 # Number of completions to generate
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
# target: required # or all
llm:
api_key: ${GRAPHRAG_API_KEY}
# api_key: ollama
type: openai_embedding # or azure_openai_embedding
# model: qwen2:0.5b
model: mistral:7b
api_base: http://localhost:11434/api
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
request_timeout: 1800.0
tokens_per_minute: 150_000 # set a leaky bucket throttle
requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
chunks:
size: 200
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
global_search:
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
参考文档
官方参考文档: Configuration Template Prompt Tuning ⚙️
数据文件地址:使用查询引擎 | GraphRAG:中文文档教程,助力大模型LLM应用开发从入门到精通
使用介绍参考 ollama轻松部署本地GraphRAG(避雷篇)_graphrag ollama-CSDN博客 傻瓜操作:GraphRAG、Ollama 本地部署及踩坑记录-CSDN博客
【个人经验】GraphRAG+Ollama 本地部署 已跑通!_errors occurred during the pipeline run, see logs -CSDN博客 GraphRAG本地运行(Ollama的LLM接口+Xinference的embedding模型)无需gpt的api_graphrag ollama-CSDN博客
嵌入模型的说明:六、OpenAI之嵌入式(Embedding)_text-embedding-3-small-CSDN博客
原理说明:深入Microsoft GraphRAG之索引阶段:原理、测试及如何集成到Neo4j图数据库 - 文章 - 开发者社区 - 火山引擎
提示词说明: Community Reports 提示词中文版 | GraphRAG:中文文档教程,助力大模型LLM应用开发从入门到精通
总结
本着一切从简的原则,没有使用conda, poetry等包管理工具。GraphRAG也是用的官方的安装包,没有采用第三方修改过的版本。
没有用梯子和付费OpenApi 服务。直接本地基于Ollama 跑 llm 提供推理和向量服务。中间试过好多模型,如qwen2:0.5b, qwen2:1.5b, Gemma2:2b, 但都没成功。后来切换到msitral:7b模型总算出了坑。所以最好的工程经验就是多多尝试。
受制于硬件,embeding 模型没有采用第三方模型,还是用llm, 以节省资源.
因本人笔记本性能原因,整个测试过程非常耗时,熬夜数日才最终跑成功。如果谁一步跑成功,祝贺你天赋异禀 :)