Figure 1: LLM-generated knowledge graph built from a private dataset using GPT-4 Turbo.
Figure 1: An LLM-generated knowledge graph built using GPT-4 Turbo.
图 1:使用 GPT-4 Turbo 构建的 LLM 生成的知识图。
GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search approaches using plain text snippets. The GraphRAG process involves extracting a knowledge graph out of raw text, building a community hierarchy, generating summaries for these communities, and then leveraging these structures when perform RAG-based tasks.
GraphRAG 是一种结构化的、分层的检索增强生成 (RAG) 方法,不同于使用纯文本片段的简单语义搜索方法。GraphRAG 的流程包括从原始文本中提取知识图谱、构建社区层次结构、为这些社区生成摘要,然后在执行基于 RAG 的任务时利用这些结构。
To learn more about GraphRAG and how it can be used to enhance your language model's ability to reason about your private data, please visit the Microsoft Research Blog Post.
要了解有关 GraphRAG 的更多信息以及如何使用它来增强语言模型推理您的私人数据的能力,请访问 Microsoft Research 博客文章 。
Solution Accelerator 🚀 解决方案加速器🚀
To quickstart the GraphRAG system we recommend trying the Solution Accelerator package. This provides a user-friendly end-to-end experience with Azure resources.
为了快速启动 GraphRAG 系统,我们建议您尝试解决方案加速器包。它提供了用户友好的 Azure 资源端到端体验。
Get Started with GraphRAG 🚀
开始使用 GraphRAG🚀
To start using GraphRAG, check out the Get Started guide. For a deeper dive into the main sub-systems, please visit the docpages for the Indexer and Query packages.
要开始使用 GraphRAG,请查看入门指南。要深入了解主要子系统,请访问 Indexer 和 Query 包的文档页面。
GraphRAG vs Baseline RAG 🔍
GraphRAG 与 Baseline RAG 对比
Retrieval-Augmented Generation (RAG) is a technique to improve LLM outputs using real-world information. This technique is an important part of most LLM-based tools and the majority of RAG approaches use vector similarity as the search technique, which we call Baseline RAG. GraphRAG uses knowledge graphs to provide substantial improvements in question-and-answer performance when reasoning about complex information. RAG techniques have shown promise in helping LLMs to reason about private datasets - data that the LLM is not trained on and has never seen before, such as an enterprise’s proprietary research, business documents, or communications. Baseline RAG was created to help solve this problem, but we observe situations where baseline RAG performs very poorly. For example:
检索增强生成 (RAG) 是一种利用真实世界信息改进 LLM 输出的技术。这项技术是大多数基于 LLM 的工具的重要组成部分,大多数 RAG 方法使用向量相似度作为搜索技术,我们称之为 Baseline RAG。GraphRAG 使用知识图谱,在推理复杂信息时显著提高问答性能。RAG 技术在帮助 LLM 推理私有数据集方面表现出色——这些数据是 LLM 未经训练且从未见过的数据,例如企业的专有研究、商业文档或通信内容。Baseline RAG 的创建是为了帮助解决这个问题,但我们观察到 Baseline RAG 的表现非常糟糕。例如:
Baseline RAG struggles to connect the dots. This happens when answering a question requires traversing disparate pieces of information through their shared attributes in order to provide new synthesized insights.
基线 RAG 难以将各个点连接起来。当回答一个问题需要通过共享属性遍历不同的信息片段,以便提供新的综合见解时,就会发生这种情况。
Baseline RAG performs poorly when being asked to holistically understand summarized semantic concepts over large data collections or even singular large documents.
当被要求全面理解大型数据集甚至单个大型文档中的总结语义概念时,基线 RAG 的表现不佳。
To address this, the tech community is working to develop methods that extend and enhance RAG. Microsoft Research’s new approach, GraphRAG, creates a knowledge graph based on an input corpus. This graph, along with community summaries and graph machine learning outputs, are used to augment prompts at query time. GraphRAG shows substantial improvement in answering the two classes of questions described above, demonstrating intelligence or mastery that outperforms other approaches previously applied to private datasets.
为了解决这个问题,科技界正在努力开发扩展和增强 RAG 的方法。微软研究院的新方法 GraphRAG 基于输入语料库创建知识图谱。该图谱与社区摘要和图机器学习输出一起,用于在查询时增强提示。GraphRAG 在回答上述两类问题方面表现出显著的进步,展现出其智能或精通程度,超越了此前应用于私有数据集的其他方法。
The GraphRAG Process 🤖
GraphRAG 流程
GraphRAG builds upon our prior research and tooling using graph machine learning. The basic steps of the GraphRAG process are as follows:
GraphRAG 建立在我们之前使用图机器学习的研究和工具之上。GraphRAG 流程的基本步骤如下:
Index 指数
Slice up an input corpus into a series of TextUnits, which act as analyzable units for the rest of the process, and provide fine-grained references in our outputs.
将输入语料库切分成一系列文本单元 (TextUnit),这些文本单元可作为其余过程的可分析单元,并在我们的输出中提供细粒度的参考。
Extract all entities, relationships, and key claims from the TextUnits.
从 TextUnits 中提取所有实体、关系和关键声明。
Perform a hierarchical clustering of the graph using the Leiden technique. To see this visually, check out Figure 1 above. Each circle is an entity (e.g., a person, place, or organization), with the size representing the degree of the entity, and the color representing its community.
使用莱顿技术对图表进行层次聚类。为了直观地查看,请查看上面的图 1。每个圆圈代表一个实体(例如,一个人、一个地点或一个组织),其大小代表实体的等级,颜色代表其所属的社群。
Generate summaries of each community and its constituents from the bottom-up. This aids in holistic understanding of the dataset.
自下而上地生成每个社区及其成员的摘要。这有助于全面理解数据集。
Query 询问
At query time, these structures are used to provide materials for the LLM context window when answering a question. The primary query modes are:
在查询时,这些结构用于为 LLM 上下文窗口提供回答问题所需的材料。主要的查询模式包括:
Global Search for reasoning about holistic questions about the corpus by leveraging the community summaries.
利用社区摘要对语料库的整体问题进行全局搜索推理。
Local Search for reasoning about specific entities by fanning-out to their neighbors and associated concepts.
局部搜索通过向邻居和相关概念展开来推理特定实体。
DRIFT Search for reasoning about specific entities by fanning-out to their neighbors and associated concepts, but with the added context of community information.
DRIFT 通过向邻居和相关概念展开来搜索有关特定实体的推理,但增加了社区信息的背景。
Prompt Tuning 快速调优
Using GraphRAG with your data out of the box may not yield the best possible results. We strongly recommend to fine-tune your prompts following the Prompt Tuning Guide in our documentation.
使用 GraphRAG 处理您的开箱即用数据可能无法获得最佳结果。我们强烈建议您按照我们文档中的 “提示调整指南” 对您的提示进行微调。
Versioning 版本控制
Please see the breaking changes document for notes on our approach to versioning the project.
请参阅重大变更文档,了解有关我们对项目进行版本控制的方法的说明。
Always run graphrag init --root [path] --force between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.
在小版本升级之间始终运行 graphrag init --root [path] --force ,以确保您拥有最新的配置格式。如果您想避免重新索引之前的数据集,请在大版本升级之间运行提供的迁移笔记本。请注意,这将覆盖您的配置和提示,因此请根据需要进行备份。
翻译原文: