【RAG系列】基于代码仓库的RAG问答，为仓库构建知识图谱

技术仔QAQ

已于 2024-10-23 19:33:18 修改

阅读量2.4k

点赞数 31

分类专栏： RAG工程文章标签：知识图谱人工智能 python 学习方法语言模型

于 2024-10-23 19:24:04 首次发布

本文链接：https://blog.csdn.net/m0_68116052/article/details/143188043

版权

RAG工程专栏收录该内容

3 篇文章

订阅专栏

前言

今天介绍的场景是基于代码仓库的RAG问答，RAG有效的关键在于能够根据用户查询检索到相关的文档，这中间涉及到索引构建和文档检索两个环节，本文就针对这两个环节来介绍为什么要在代码仓库上构建知识图谱，以及如何构建和将知识图谱应用到LLM和RAG工程中。

一、为什么要在代码仓库上构建知识图谱

LLM对代码具有很多的理解能力，将具体的代码片段复制给LLM然后结合prompt工程，LLM通常可以很好地处理各类代码任务（编辑/生成/问答等）。如果将场景拉回到仓库问答中，整个仓库代码太过庞大，都放入LLM中显然是不现实的，因此通常会借助RAG工程对整个仓库的代码进行分割索引，在这里只介绍RAG的索引步骤。

用一个例子简单介绍现有的code切割方式，这是一个简单的 Python 代码库，其中“requirements.txt”定义了依赖项，“README.md”描述了项目，“src/”目录包含代码库的源代码。该应用程序在不同时区之间转换时间，代码库结构如下：

TimeUtils/
├── src/
│ ├── time_app.py
│ └── time_utils.py
├── requirements.txt
└── README.md

在 time_utils.py 文件中，我们有：

from datetime import datetime
import pytz

def get_current_utc_time():
    """Returns the current UTC time as a formatted string."""
    utc_now = datetime.now(pytz.utc)
    return utc_now.strftime("%Y-%m-%d %H:%M:%S")

def convert_time_to_timezone(timezone_str):
    """Converts the current UTC time to a specified timezone."""
    try:
        local_zone = pytz.timezone(timezone_str)
        local_time = datetime.now(local_zone)
        return local_time.strftime("%Y-%m-%d %H:%M:%S")
    except pytz.exceptions.UnknownTimeZoneError:
        return "Invalid timezone."

如果简单地将源代码视为文本，我们可以使用文本分割器（例如LangChain中的CharacterTextSplitter）根据字符分割文本。它会将源代码分为：

############################ Chunk 1 ############################
from datetime import datetime
import pytz

def get_current_utc_time():
    """Returns the current UTC time as a formatted string."""
    utc_now = datetime
############################ Chunk 2############################
                      .now(pytz.utc)
    return utc_now.strftime("%Y-%m-%d %H:%M:%S")

def convert_time_to_timezone(timezone_str):
    """Converts the current UTC time to a specified timezone."""
    try:
        local_zone = pytz.timezone(timezone_str)
        local_time = date

############################ Chunk 3############################
                         time.now(local_zone)
        return local_time.strftime("%Y-%m-%d %H:%M:%S")
    except pytz.exceptions.UnknownTimeZoneError:
        return "Invalid timezone."

这样的分割方式显然破坏了函数的完整性，这给后面的RAG检索造成了困难，如果我们询问“convert_time_to_timezone 函数的作用是什么？ ”。它很难正确回答问题，因为“convert_time_to_timezone”函数被分为两个块，很难同时被检索并且组合到一起。

当然，也有专门针对源代码设计的分割器，比如LlamaIndex中的CodeSplitter。它们使用第三方工具基于源代码生成抽象语法树（AST）。 AST 是表示程序的树形数据结构。下面是来自维基百科的一个例子：

while b ≠ 0:
    if a > b:
        a := a - b
    else:
        b := b - a
return a

AST 可以识别每个源代码组件（类、函数……）的确切位置边界。使用CodeSplitter，我们就可以获得以下chunk：

############################ Chunk 1 ############################
from datetime import datetime
import pytz

############################ Chunk 2 ############################
def get_current_utc_time():
    """Returns the current UTC time as a formatted string."""
    utc_now = datetime.now(pytz.utc)
    return utc_now.strftime("%Y-%m-%d %H:%M:%S")

############################ Chunk 3 ############################
def convert_time_to_timezone(timezone_str):
    """Converts the current UTC time to a specified timezone."""
    try:
        local_zone = pytz.timezone(timezone_str)
        local_time = datetime.now(local_zone)
        return local_time.strftime("%Y-%m-%d %H:%M:%S")
    except pytz.exceptions.UnknownTimeZoneError:
        return "Invalid timezone."

现在，如果询问“convert_time_to_timezone 函数的作用是什么？”，应该能够被正确检索并回答。

（注：以上是维基百科的例子，但实际使用发现和langchain的CharacterTextSplitter差不多，感兴趣的可以尝试一下。）

然而，即使将源代码拆分为结构有意义的块，也面临着丢失chunk内部和chunk之间关系的信息的挑战。源代码与自然语言有着根本的不同——它具有特定的结构并且是可执行的，代码库中的文件具有固有的结构并且它们之间存在依赖关系。用这种碎片化方式处理源代码的 RAG 系统将难以应对诸如“time_utils.py 文件中定义了多少个函数？”，或 “列出使用变量 local_zone 的所有文件”之类的查询。这些问题的答案不一定位于单个块、函数、类或文件中。为了有效地回答此类查询，必须具备对整个代码库进行推理的能力。这正是在代码库上构建知识图变得无价的地方。

二、如何在代码仓库上构建知识图谱

在代码库上构建图在软件工程领域并不是一个新概念。有很多工作对源代码进行静态分析和动态分析来构建图，用于机器代码优化或漏洞检测。

简单的知识图谱

在这里，我们展示如何在代码库上构建一个简单的知识图，以允许LLM对整个代码库进行推理。在示例图中，使用蓝色节点表示文件/目录，使用绿色节点表示 AST 节点。在文件节点之间，父目录和子文件之间有 HAS_FILE 边；在文件节点和 AST 节点之间，源代码文件和根 AST 节点之间有 HAS_AST 边；在 AST 节点之间，父 AST 节点和子 AST 节点之间有 HAS_PARENT 边：

然后将这个图存储在Neo4J等图数据库中，而之前的问题：“time_utils.py文件中定义了多少个函数？ ”，或 “列出使用变量 local_zone 的所有文件”可以用能够生成以下答案的 LLM 来回答：

Generated Cypher:
MATCH (f:file {filename: 'time_util.py'})-[:HAS_AST]->(root:ast)
MATCH (root)-[:PARENT_OF*]->(func:ast {type: 'function_definition'})
RETURN count(func) AS NumberOfFunctions
Full Context:
[{'NumberOfFunctions': 2}]

{'input': 'How many functions are defined in the time_utils.py file?',
 'output': 'There are 2 functions defined in the time_utils.py file.'}

Generated Cypher:
MATCH (f:file)-[:HAS_AST]->(root:ast)
MATCH (root)-[:PARENT_OF*]->(func:ast{{type: 'identifier', text: 'local_zone'}})
RETURN f
Full Conext:
[{'f': { 'filename': 'time_utils.py' }}]

{'input': 'List all files where the variable local_zone is used.',
 'output': 'The variable local_zone is used in the time_utils.py file'}

通过这种方式，LLM 根据用户查询生成 Cypher 查询，并使用结果进行响应。这些查询可以由 LLM 使用 LangChain GraphCypherQAChain 生成。此外，如果你在知识图上构建一个RAG Agent，可以提示它多次查询图数据库，然后将多个Cypher 查询链接起来，解决更复杂的问题。

更高级的知识图谱

知识图可以使用静态分析中使用的其他类型的图来进一步扩展，例如数据流图或控制流图。更有意思的是图能够动态合并运行时数据（或来自动态分析），例如测试的覆盖率，合并这些动态信息将使LLM有更大的潜力解决更困难的问题。

总结

本文回答了为什么在代码仓库中构建知识图谱，以及如何构建知识图谱。

具体涉及到代码的RAG索引构建、知识图谱的构建和使用场景。知识图谱无论是作为RAG工程的一部分，还是作为Agent的一个工具都是值得深入探索的。

更多关于代码仓库的内容可谷歌学术搜索：“repository level coding” or “repository level code llm”

Reference

[1] Allamanis, Miltiadis, Marc Brockschmidt, and Mahmoud Khademi. “Learning to represent programs with graphs.” arXiv preprint arXiv:1711.00740 (2017).

[2] Yasunaga, Michihiro, and Percy Liang. “Graph-based, self-supervised program repair from diagnostic feedback.” International Conference on Machine Learning. PMLR, 2020.

[3] Chen, Zimin, et al. “PLUR: A unifying, graph-based view of program learning, understanding, and repair.” Advances in Neural Information Processing Systems 34 (2021): 23089–23101.

[4] Wang, Xin, et al. “CODE-MVP: Learning to represent source code from multiple views with contrastive pre-training.” arXiv preprint arXiv:2205.02029 (2022).

[5] Lou, Yiling, et al. “Boosting coverage-based fault localization via graph-based representation learning.” Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2021.

[6] Luo, Qinyu, et al. “RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation.” arXiv preprint arXiv:2402.16667 (2024).

[7] Bairi, Ramakrishna, et al. “Codeplan: Repository-level coding using llms and planning.” arXiv preprint arXiv:2309.12499 (2023).

[8] Phan, Huy N., et al. “RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion.” arXiv preprint arXiv:2403.06095 (2024).