AI大模型源码解析｜打造你的专属GitHub智能答疑助手！

本文链接：https://blog.csdn.net/double_sweet1/article/details/142844804

嗨！我是小谷，大家好久不见～

前段时间网上冲浪，发现一个 大模型在垂类场景的典型应用 ，定位是 Github 智能答疑助手，功能包含优质仓库推荐、代码片段解读、提 issue 、查 issue … 对于 Ant Design 这种前端组件库，甚至可以通过一张原型图，给出界面上包含哪些 Ant Design 的原子组件！

亲自体验了一番，感觉确实很方便，发现源码也已经在 Github 上开源了，整体实现涉及 Prompt 调优、向量化知识库构建以及 Langchain 工具链集成，每个环节都有很多学习之处，那就让我们一起通过源码看看大模型部分的实现细节吧！

PeterCat 简介

在这里插入图片描述
如官网介绍，PeterCat 是专为社区维护者和开发者打造的智能答疑机器人解决方案。

支持用户在平台中通过对话模式快速搭建一个 Github 仓库的智能答疑机器人，内置提 issue、查 issue、回 issue、Discussion 回复、PR Summary、Code Review、项目信息查询基础能力，也可以通过自托管部署方案和一体化应用 SDK 集成至项目仓库。

在这里插入图片描述

目前在 PeterCat 平台中，已有 9 个前端方向典型应用的智能答疑机器人。

以 Ant Design 为例，我们不仅可以通过 Ant Design 答疑小助手来了解如何快速上手 Ant Design 组件库的使用，还能通过一张原型图快速判断出通过 Ant Design 的哪些组件可以实现，甚至连图表和文字都能准确识别，大模型对于图片中内容的解析真的有超乎预期！

原型稿	识别结果

接下来，就让我们一起来看看 PeterCat 是如何做到的吧！

源码解读

业界通用方案

所谓大模型垂类应用场景，指的是大模型在特定领域的应用，主要解决的是诸如 GPT、通义千问等通用大模型在特定领域由于缺乏领域知识而表现欠佳的问题。这类场景通常需要喂给大模型大量的数据作为知识库进行辅助决策。

关于大模型在垂类场景的应用，常见的执行 SOP 包含：

向量化知识库构建 -> (大模型微调) -> 用户 prompt 输入 -> 向量化关键词检索 -> 查询结果精排 -> 大模型 prompt 生成 -> 大模型意图识别 -> 结果生成。

在这里插入图片描述
如果是自建大模型，还可以通过大模型微调来让意图识别质量更高。

可以发现，无论在知识库构建阶段，还是关键词检索阶段，向量化 都是被反复提及的概念。

所谓向量化，其实指的是将大规模的数据或文本转化为向量的表示方式。经过向量化处理的数据，能够更好地表达数据之间的关系和相似性，提高模型的训练和预测效果。

而从向量化关键词检索到大模型 prompt 生成的过程，就是我们常说的 RAG ，目的是为了最终喂给大模型的 Prompt 质量更高，也就是大家常说的如何更好的向大模型提问。

在上述的 SOP 执行基础上，借助 Langchain 将 Embedding、Prompt 生成、工具链式调用进行集成，就大致可以完成大模型在垂类领域的完整调用了。

在这里插入图片描述
如果把大模型比作是人类的大脑，那么 Langchain 可以类比为人类的四肢和躯干，大模型只用专注于模型预测的核心工作，至于工具调用、上下文记忆、多轮对话等工作交给 Langchain 进行统筹管理即可。

PeterCat 源码

有了上述一些基础知识，我们来看下 PeterCat 在大模型相关模块的实现细节：

PeterCat 大模型相关的代码主要集中在 server/agent 目录下，整体包含：

bot 构建、llm 调用、prompt 设定和 tools 工具预置四大能力，
最终以 base.py 和 qa_chat.py 文件作为 Langchain 处理用户输入以及大模型结果输出的中枢。

而向量化知识库的构建，以及 RAG 检索生成，集中在 petercat_utils 目录下实现，最终通过 server/routers/rag.py 进行调用。

Agent 工作流

按照 PeterCat 官方文档的介绍，当用户在 PeterCat 平台中输入了一个 Github 仓库地址或名称时，创建智能答疑机器人的整个 Agent 工作流为：

使用创建工具，生成该仓库答疑机器人的各项配置（Prompt，、名字、头像、开场白、引导语、工具集……），同时触发 Issue 和 Markdown 的入库任务。

这些任务会拆分为多个子任务，将该仓库的所有已解决 issue 、高票回复以及所有 Markdown 文件内容经过 load -> split -> embed -> store 的加工过程进行知识库构建，作为机器人的回复知识依据。
在这里插入图片描述

RAG

PeterCat 服务端，采用 FastAPI 框架开发，使用了 supabase 作为数据存储方案。

为了在 RAG 检索生成时比对效率更高，机器人关联的 Github 仓库中所有文档都是通过 Embedding 处理后分类存入数据库中。
用户输入的 query 也会经过 Embedding 化后，与 supabase 中存储的知识进行匹配，再返回匹配结果。

这个过程就包含了前文说的 load -> split -> embed -> store 。

def supabase_embedding(documents, **kwargs: Any):
    from langchain_text_splitters import CharacterTextSplitter

    try:
        text_splitter = CharacterTextSplitter(
            chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
        )
        docs = text_splitter.split_documents(documents)
        embeddings = OpenAIEmbeddings()
        vector_store = SupabaseVectorStore.from_documents(
            docs,
            embeddings,
            client=get_client(),
            table_name=TABLE_NAME,
            query_name=QUERY_NAME,
            chunk_size=CHUNK_SIZE,
            **kwargs,
        )
        return vector_store
    except Exception as e:
        print(e)
        return None

def add_knowledge_by_doc(config: RAGGitDocConfig):
    loader = init_github_file_loader(config)
    documents = loader.load()
    supabase = get_client()
    is_doc_added_query = (
        supabase.table(TABLE_NAME)
        .select("id, repo_name, commit_id, file_path, bot_id")
        .eq("repo_name", config.repo_name)
        .eq("commit_id", loader.commit_id)
        .eq("file_path", config.file_path)
        .eq("bot_id", config.bot_id)
        .execute()
    )
    if not is_doc_added_query.data:
        is_doc_equal_query = (
            supabase.table(TABLE_NAME).select("*").eq("file_sha", loader.file_sha)
        ).execute()
        if not is_doc_equal_query.data:
            # If there is no file with the same file_sha, perform embedding.
            store = supabase_embedding(
                documents,
                repo_name=config.repo_name,
                commit_id=loader.commit_id,
                file_sha=loader.file_sha,
                file_path=config.file_path,
                bot_id=config.bot_id,
            )
            return store
        else:
            new_commit_list = [
                {
                    **{k: v for k, v in item.items() if k != "id"},
                    "repo_name": config.repo_name,
                    "commit_id": loader.commit_id,
                    "file_path": config.file_path,
                    "bot_id": config.bot_id,
                }
                for item in is_doc_equal_query.data
            ]
            insert_result = supabase.table(TABLE_NAME).insert(new_commit_list).execute()
            return insert_result
    else:
        return True

def search_knowledge(
    query: str,
    bot_id: str,
    meta_filter: Dict[str, Any] = {},
):
    retriever = init_retriever({"filter": {"metadata": meta_filter, "bot_id": bot_id}})
    docs = retriever.invoke(query)
    documents_as_dicts = [convert_document_to_dict(doc) for doc in docs]
    json_output = json.dumps(documents_as_dicts, ensure_ascii=False)
    return json_output

LLM 调用

PeterCat 内置了 Openai 和 Gemini 两大基座模型。作为智能答疑的核心，支持对模型对应回答随机性的微调。

@register_llm_client("openai")
class OpenAIClient(BaseLLMClient):
    _client: ChatOpenAI

    def __init__(
        self,
        temperature: Optional[int] = 0.2,
        max_tokens: Optional[int] = 1500,
        streaming: Optional[bool] = False,
        api_key: Optional[str] = OPEN_API_KEY,
    ):
        self._client = ChatOpenAI(
            model_name="gpt-4o",
            temperature=temperature,
            streaming=streaming,
            max_tokens=max_tokens,
            openai_api_key=api_key,
            stream_usage=True,
        )

    def get_client(self):
        return self._client

    def get_tools(self, tools: List[Any]):
        return [convert_to_openai_tool(tool) for tool in tools]

    def parse_content(self, content: List[MessageContent]):
        return content

Prompt 设定

Prompt 作为影响大模型回复质量的重要依据，让人忍不住夸赞的是，PeterCat 的预置 Prompt 写的真的是非常专业！

这里放一段创建/编辑智能答疑机器人的 prompt 供大家阅读，不仅清晰的描述了向大模型提问的诉求，还向 Langchain 抛出了使用什么工具来完成工程链路的工作（ create_bot tool、 edit_bot tool)。

from typing import Optional

CREATE_PROMPT = """
## Role:
You are a GitHub Answering Bot Creation Assistant. You specialize in creating a Q&A bot based on the information of a GitHub repository provided by the user.

## Skills:

Skill 1: Retrieve GitHub Repository Name

- Guide users to provide their GitHub repository name or URL.
- Extract the GitHub repository name from the provided GitHub URL

Skill 2: Create a Q&A Bot

- Use the create_bot tool to create a bot based on the GitHub repository name provided by the user.
- The uid of the current user is {user_id}

Skill 3: Modify Bot Configuration

- Utilize the edit_bot tool to modify the bot's configuration information based on the user's description.
- Always use the created bot's ID as the id of the bot being edited and the user's ID as the uid.
- If the user wishes to change the avatar, ask user to provide the URL of the new avatar.

## Limitations:

- Can only create a Q&A bot or update the configuration of the bot based on the GitHub repository information provided by the user.
- During the process of creating a Q&A bot, if any issues or errors are encountered, you may provide related advice or solutions, but must not directly modify the user's GitHub repository.
- When modifying the bot's configuration information, you must adhere to the user's suggestions and requirements and not make changes without permission.
- Whenever you encounter a 401 or Unauthorized error that seems to be an authentication failure, please inform the user in the language they are using to converse with you. For example:

If user is conversing with you in Chinese:
“您必须先使用 GitHub 登录 Petercat 才能使用此功能。[登录地址](https://api.petercat.ai/api/auth/login)

If user is conversing with you in English:
“You must log in to Petercat using GitHub before accessing this feature.” [Login URL](https://api.petercat.ai/api/auth/login)
"""

EDIT_PROMPT = """
## Role:
You are a GitHub Answering Bot modifying assistant. You specialize in modifying the configuration of a Q&A bot based on the user's requirements.

## Skills:

- Utilize the edit_bot tool to modify the bot's configuration information based on the user's description.
- Always use the created bot's ID: {bot_id} as the id of the bot being edited and the uid of the current user is {user_id}.
- If the user wishes to change the avatar, ask user to provide the URL of the new avatar.

## Limitations:

- Can only update the configuration of the bot based on the GitHub repository information provided by the user.
- During the process of  a Q&A bot, if any issues or errors are encountered, you may provide related advice or solutions, but must not directly modify the user's GitHub repository.
- When modifying the bot's configuration information, you must adhere to the user's suggestions and requirements and not make changes without permission.

If user is conversing with you in Chinese:
“您必须先使用 GitHub 登录 Petercat 才能使用此功能。[登录地址](https://api.petercat.ai/api/auth/login)

If user is conversing with you in English:
“You must log in to Petercat using GitHub before accessing this feature.” [Login URL](https://api.petercat.ai/api/auth/login)
"""

def generate_prompt_by_user_id(user_id: str, bot_id: Optional[str]):
    if bot_id:
        return EDIT_PROMPT.format(bot_id=bot_id, user_id=user_id)
    else:
        return CREATE_PROMPT.format(user_id=user_id)

Tools 工具预置

前文说到，在 PeterCat 平台搭建的智能答疑机器人内置了提 issue、查 issue、回 issue、Discussion 回复、PR Summary、Code Review、项目信息查询的基础能力。

这些工作本质属于工程链路的内容，提供给 Langchain 进行工具调用。

def _create_agent_with_tools(self) -> AgentExecutor:
        llm = self.chat_model.get_client()

        tools = self.init_tavily_tools() if self.enable_tavily else []

        for tool in self.tools.values():
            tools.append(tool)

        if tools:
            parsed_tools = self.chat_model.get_tools(tools)
            llm = llm.bind_tools(parsed_tools)

        self.prompt = self.get_prompt()
        agent = (
            {
                "input": lambda x: x["input"],
                "agent_scratchpad": lambda x: format_to_openai_tool_messages(
                    x["intermediate_steps"]
                ),
                "chat_history": lambda x: x["chat_history"],
            }
            | self.prompt
            | llm
            | OpenAIToolsAgentOutputParser()
        )

        return AgentExecutor(
            agent=agent,
            tools=tools,
            verbose=True,
            handle_parsing_errors=True,
            max_iterations=5,
        )

bot 构建

有了上述 LLM 模型选择、Prompt 设定和 Tools 工具的预置，我们就可以构建出一个智能答疑机器人了：

def agent_stream_chat(
    input_data: ChatData, 
    user_id: str,
    bot_id: str,
) -> AsyncIterator[str]:
    prompt = generate_prompt_by_user_id(user_id, bot_id)
    agent = AgentBuilder(
        chat_model=OpenAIClient(),
        prompt=prompt, tools=TOOL_MAPPING, enable_tavily=False
    )
    return dict_to_sse(
        agent.run_stream_chat(input_data)
    )

Langchain 集成

前文说到，PeterCat 提供的答疑机器人具备很多能力，什么时候该创建一个机器人、什么时候该为用户解答 Github 应用的项目数据、亦或者替用户回复一个 issue 、提取上下文… 这些其实都由 Langchain 来进行统筹决策。

Langchain 内置的 AgentExecutor 会根据用户输入决策使用什么工具执行任务，tavily_search 会在工具列表中快速检索，ChatPromptTemplate 则对对话上下文进行了格式化处理。

import json
import logging
from typing import AsyncGenerator, AsyncIterator, Dict, Callable, Optional
from langchain.agents import AgentExecutor
from agent.llm import BaseLLMClient
from petercat_utils.data_class import ChatData, Message
from langchain.agents.format_scratchpad.openai_tools import (
    format_to_openai_tool_messages,
)
from langchain_core.messages import (
    AIMessage,
    FunctionMessage,
    HumanMessage,
    SystemMessage,
)
from langchain.agents.output_parsers.openai_tools import OpenAIToolsAgentOutputParser
from langchain.prompts import MessagesPlaceholder
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.utilities.tavily_search import TavilySearchAPIWrapper
from langchain_community.tools.tavily_search.tool import TavilySearchResults
from petercat_utils import get_env_variable


TAVILY_API_KEY = get_env_variable("TAVILY_API_KEY")

logger = logging.getLogger()


async def dict_to_sse(generator: AsyncGenerator[Dict, None]):
    ...

class AgentBuilder:
    agent_executor: AgentExecutor

    def __init__(
        self,
        chat_model: BaseLLMClient,
        prompt: str,
        tools: Dict[str, Callable],
        enable_tavily: Optional[bool] = True,
    ):
        """
        @class `Builde AgentExecutor based on tools and prompt`
        @param prompt: str
        @param tools: Dict[str, Callable]
        @param enable_tavily: Optional[bool] If set True, enables the Tavily tool
        """
        self.prompt = prompt
        self.tools = tools
        self.enable_tavily = enable_tavily
        self.chat_model = chat_model
        self.agent_executor = self._create_agent_with_tools()

    def init_tavily_tools(self):
        # init Tavily
        search = TavilySearchAPIWrapper()
        tavily_tool = TavilySearchResults(api_wrapper=search)
        return [tavily_tool]

    def _create_agent_with_tools(self) -> AgentExecutor:
        ...

    def get_prompt(self):
        ...

    def chat_history_transform(self, messages: list[Message]):
        ...

    async def run_stream_chat(self, input_data: ChatData) -> AsyncIterator[Dict]:
        ...
    async def run_chat(self, input_data: ChatData) -> str:
        ...