微软GraphRAG：索引构建源码解读

Zhong Yang

已于 2025-04-22 17:02:10 修改

阅读量870

点赞数 18

文章标签： microsoft graphrag

于 2025-04-22 16:31:57 首次发布

本文链接：https://blog.csdn.net/zhong_vic/article/details/147425021

版权

GraphRAG 索引构建源码解读

一、引言

GraphRAG 是一个用于处理知识图谱相关任务的工具，其索引构建过程是整个系统的核心部分，涉及到数据的加载、处理、图的构建以及社区的生成等多个关键步骤。本文将对 GraphRAG 索引构建的源码进行详细解读，帮助读者理解其内部实现机制。

二、索引构建整体流程概述

GraphRAG 的索引构建主要包括以下几个核心步骤：

配置加载：读取并解析配置文件，获取索引构建所需的参数。
数据加载：根据配置从指定数据源加载输入数据。
数据处理：对加载的数据进行清洗、转换等操作，确保数据的质量和一致性。
图构建：基于处理后的数据构建图结构，用于表示实体之间的关系。
社区生成：对图进行聚类，生成社区结构，便于后续的查询和分析。
报告生成：为每个社区生成报告，总结社区的关键信息。

三、关键模块源码分析

3.1 配置加载

配置加载主要涉及到配置文件的解析和参数的验证。以下是相关代码示例：

# filePath：graphrag/config/load_config.py #startLine: 132 #endLine: 143
def _parse(file_extension: str, contents: str) -> dict[str, Any]:
    """Parse configuration."""
    match file_extension:
        case ".yaml" | ".yml":
            return yaml.safe_load(contents)
        case ".json":
            return json.loads(contents)
        case _:
            msg = (
                f"Unable to parse config. Unsupported file extension: {file_extension}"
            )
            raise ValueError(msg)

这段代码实现了根据文件扩展名解析配置文件的功能。如果文件扩展名是 .yaml 或 .yml，则使用 yaml.safe_load 进行解析；如果是 .json，则使用 json.loads 进行解析；否则抛出 ValueError 异常。

3.2 数据加载

数据加载部分根据配置文件中的输入类型（如 blob 存储或文件存储）加载输入数据，并将其转换为 pandas.DataFrame 对象。

# filePath：graphrag/index/input/factory.py
async def create_input(
    config: InputConfig,
    progress_reporter: ProgressLogger | None = None,
    root_dir: str | None = None,
) -> pd.DataFrame:
    # ...
    match config.type:
        case InputType.blob:
            # 使用blob存储输入
            pass
        case InputType.file:
            # 使用文件存储输入
            pass
        case _:
            # 默认使用文件存储输入
            pass
    # ...
    return result

该函数根据配置的输入类型选择相应的加载方式，最终返回一个 pandas.DataFrame 对象。

3.3 数据处理

数据处理主要包括对数据列的处理，确保必要的列（如 id、text、title）存在。

# filePath：graphrag/index/input/util.py
def process_data_columns(
    documents: pd.DataFrame, config: InputConfig, path: str
) -> pd.DataFrame:
    if "id" not in documents.columns:
        documents["id"] = documents.apply(
            lambda x: gen_sha512_hash(x, x.keys()), axis=1
        )
    if config.text_column is not None and "text" not in documents.columns:
        # 处理text列
        pass
    if config.title_column is not None:
        # 处理title列
        pass
    else:
        documents["title"] = documents.apply(lambda _: path, axis=1)
    return documents

这段代码会检查数据中是否存在必要的列，如果不存在则进行相应的处理。例如，如果 id 列不存在，则使用 gen_sha512_hash 函数为每一行生成一个唯一的 id。

3.4 图构建

图构建是根据处理后的数据创建网络 x 图。

# filePath：graphrag/index/operations/create_graph.py
def create_graph(
    edges: pd.DataFrame,
    edge_attr: list[str | int] | None = None,
    nodes: pd.DataFrame | None = None,
    node_id: str = "title",
) -> nx.Graph:
    graph = nx.from_pandas_edgelist(edges, edge_attr=edge_attr)
    if nodes is not None:
        nodes.set_index(node_id, inplace=True)
        graph.add_nodes_from((n, dict(d)) for n, d in nodes.iterrows())
    return graph

该函数使用 nx.from_pandas_edgelist 函数从 edges 数据框中创建图，并根据需要添加节点信息。

3.5 社区生成

社区生成部分对图进行聚类，创建社区，并将实体和关系信息聚合到社区中。

# filePath：graphrag/index/workflows/create_communities.py
def create_communities(
    entities: pd.DataFrame,
    relationships: pd.DataFrame,
    max_cluster_size: int,
    use_lcc: bool,
    seed: int | None = None,
) -> pd.DataFrame:
    graph = create_graph(relationships)
    clusters = cluster_graph(
        graph,
        max_cluster_size,
        use_lcc,
        seed=seed,
    )
    communities = pd.DataFrame(
        clusters, columns=pd.Index(["level", "community", "parent", "title"])
    ).explode("title")
    # ...
    return final_communities

该函数首先调用 create_graph 函数创建图，然后使用 cluster_graph 函数对图进行聚类，最后将聚类结果转换为 pandas.DataFrame 对象。

3.6 报告生成

报告生成部分为每个社区生成报告。

# filePath：graphrag/index/operations/summarize_communities/summarize_communities.py
async def run_generate(record):
    result = await _generate_report(
        strategy_exec,
        community_id=record[schemas.COMMUNITY_ID],
        community_level=record[schemas.COMMUNITY_LEVEL],
        community_context=record[schemas.CONTEXT_STRING],
        callbacks=callbacks,
        cache=cache,
        strategy=strategy_config,
    )
    tick()
    return result

该函数通过异步调用 _generate_report 函数为每个社区生成报告，并在完成后更新进度。

四、并行处理机制

在整个索引创建过程中，GraphRAG 可能会使用并行处理来提高效率。例如，derive_from_rows 函数可以对每行数据应用转换函数。

# filePath：graphrag/index/utils/derive_from_rows.py
async def derive_from_rows(
    input: pd.DataFrame,
    transform: Callable[[pd.Series], Awaitable[ItemType]],
    callbacks: WorkflowCallbacks | None = None,
    num_threads: int = 4,
    async_type: AsyncType = AsyncType.AsyncIO,
) -> list[ItemType | None]:
    callbacks = callbacks or NoopWorkflowCallbacks()
    match async_type:
        case AsyncType.AsyncIO:
            return await derive_from_rows_asyncio(
                input, transform, callbacks, num_threads
            )
        case AsyncType.Threaded:
            return await derive_from_rows_asyncio_threads(
                input, transform, callbacks, num_threads
            )
        case _:
            msg = f"Unsupported scheduling type {async_type}"
            raise ValueError(msg)

该函数根据指定的异步类型（AsyncIO 或 Threaded）选择相应的并行处理方式，对输入数据的每一行应用转换函数。

五、总结

通过对 GraphRAG 索引构建源码的解读，我们可以看到其索引构建过程涉及到多个关键步骤和模块，包括配置加载、数据加载、数据处理、图构建、社区生成和报告生成等。同时，为了提高效率，还采用了并行处理机制。这些设计和实现使得 GraphRAG 能够高效地处理大规模的知识图谱数据，为后续的查询和分析提供有力支持。