利用大模型 MDAs 解锁多文档中蕴含的集体知识，生成更准确的答案

最新推荐文章于 2024-06-15 23:36:03 发布

Python算法实战

最新推荐文章于 2024-06-15 23:36:03 发布

阅读量938

点赞数 27

分类专栏：大模型理论与实战文章标签：算法大模型 RAG 多模态深度学习

本文链接：https://blog.csdn.net/2301_78285120/article/details/139206268

版权

大模型理论与实战专栏收录该内容

174 篇文章 172 订阅

订阅专栏

在自然语言处理和信息检索领域，多文档代理（Multi-Document Agents, MDAs）的出现标志着一个重要的进步。

这些代理引入了一系列增强功能，包括文档检索过程中的重新排序和一个复杂的查询规划工具。这代表了信息检索系统的范式转变。传统上，搜索引擎或文档检索系统依赖于单文档方法，限制了它们提供全面且细致入微的复杂查询响应的能力。

然而，MDAs 利用多文档中蕴含的集体知识，生成更准确和更有见地的响应。

在本指南中，我们将深入探讨MDAs的世界，探索其定义、在LlamaIndex中的优势，甚至提供其代码实现的见解。喜欢本文记得收藏、点赞、关注，希望大模型技术交流的文末加入我们。

定义

在深入探讨之前，让我们澄清一些关键概念：

多文档代理（MDAs）：这些是能够处理和综合来自多个文档的信息，以提供全面响应用户查询的智能系统。

LlamaIndex：一个尖端平台，促进文档索引和检索，作为构建强大MDAs的骨干。

多文档代理的优势

综合信息检索：通过利用多个文档中的集体知识，由LlamaIndex驱动的MDAs可以为用户提供更全面和准确的查询响应。这确保了更丰富的用户体验并促进更深的理解。
增强相关性：在检索过程中重新排序文档的能力使MDAs能够优先考虑最相关的信息，从而提高响应的质量。这确保了用户获得的都是与其特定查询最相关的信息。
高效查询规划：查询规划工具的加入使MDAs能够优化其搜索策略，从而实现更高效和更有效的信息检索。这确保了用户能及时获得相关响应，提升整体用户满意度。
可扩展性：LlamaIndex提供了一个强大的基础设施，支持大规模构建和部署MDAs。这种可扩展性使MDAs能够处理大量数据并支持各种不同的用例，使其成为跨各种领域应用的理想选择。

代码实现

现在，让我们探索使用LlamaIndex实现MDAs的基本代码实现：

第一步：安装库

%pip install llama-index-core
%pip install llama-index-agent-openai
%pip install llama-index-readers-file
%pip install llama-index-postprocessor-cohere-rerank
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-llms-anthropic
%pip install llama-index-embeddings-huggingface
%pip install unstructured[html]

第二步：设置和下载数据

domain = "docs.llamaindex.ai"
docs_url = "https://docs.llamaindex.ai/en/latest/"
!wget -e robots=off --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains {domain} --no-parent {docs_url}
from llama_index.readers.file import UnstructuredReader
from pathlib import Path

reader = UnstructuredReader()

all_files_gen = Path("./docs.llamaindex.ai/").rglob("*")
all_files = [f.resolve() for f in all_files_gen]

all_html_files = [f for f in all_files if f.suffix.lower() == ".html"]

第三步：定义全局LLM和嵌入

import os
import nest_asyncio

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

os.environ["OPENAI_API_KEY"] = "sk-..."

nest_asyncio.apply()

llm = OpenAI(model="gpt-3.5-turbo")
Settings.llm = llm
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small", embed_batch_size=256
)

第四步：构建多文档代理

from llama_index.agent.openai import OpenAIAgent
from llama_index.core import (
    load_index_from_storage,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.node_parser import SentenceSplitter
import os
from tqdm.notebook import tqdm
import pickle

async def build_agent_per_doc(nodes, file_base):
    print(file_base)

    vi_out_path = f"./data/llamaindex_docs/{file_base}"
    summary_out_path = f"./data/llamaindex_docs/{file_base}_summary.pkl"
    if not os.path.exists(vi_out_path):
        Path("./data/llamaindex_docs/").mkdir(parents=True, exist_ok=True)
        # 构建向量索引
        vector_index = VectorStoreIndex(nodes)
        vector_index.storage_context.persist(persist_dir=vi_out_path)
    else:
        vector_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=vi_out_path),
        )

    # 构建摘要索引
    summary_index = SummaryIndex(nodes)

    # 定义查询引擎
    vector_query_engine = vector_index.as_query_engine(llm=llm)
    summary_query_engine = summary_index.as_query_engine(
        response_mode="tree_summarize", llm=llm
    )

    # 提取摘要
    if not os.path.exists(summary_out_path):
        Path(summary_out_path).parent.mkdir(parents=True, exist_ok=True)
        summary = str(
            await summary_query_engine.aquery(
                "Extract a concise 1-2 line summary of this document"
            )
        )
        pickle.dump(summary, open(summary_out_path, "wb"))
    else:
        summary = pickle.load(open(summary_out_path, "rb"))

    # 定义工具
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name=f"vector_tool_{file_base}",
                description=f"Useful for questions related to specific facts",
            ),
        ),
        QueryEngineTool(
            query_engine=summary_query_engine,
            metadata=ToolMetadata(
                name=f"summary_tool_{file_base}",
                description=f"Useful for summarization questions",
            ),
        ),
    ]

    # 构建代理
    function_llm = OpenAI(model="gpt-4")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
        system_prompt=f"""\
You are a specialized agent designed to answer queries about the `{file_base}.html` part of the LlamaIndex docs.
You must ALWAYS use at least one of the tools provided when answering a question; do NOT rely on prior knowledge.\
""",
    )

    return agent, summary

async def build_agents(docs):
    node_parser = SentenceSplitter()

    # 构建代理字典
    agents_dict = {}
    extra_info_dict = {}

    # # 基准测试用
    # all_nodes = []

    for idx, doc in enumerate(tqdm(docs)):
        nodes = node_parser.get_nodes_from_documents([doc])
        # all_nodes.extend(nodes)

        # ID 将是 base + parent
        file_path = Path(doc.metadata["path"])
        file_base = str(file_path.parent.stem) + "_" + str(file_path.stem)
        agent, summary = await build_agent_per_doc(nodes, file_base)

        agents_dict[file_base] = agent
        extra_info_dict[file_base] = {"summary": summary, "nodes": nodes}

    return agents_dict, extra_info_dict

agents_dict, extra_info_dict = await build_agents(docs)

第五步：构建支持检索的OpenAI代理

# 为每个文档代理定义工具
all_tools = []
for file_base, agent in agents_dict.items():
    summary = extra_info_dict[file_base]["summary"]
    doc_tool = QueryEngineTool(
        query_engine=agent,
        metadata=ToolMetadata(
            name=f"tool_{file_base}",
            description=summary,
        ),
    )
    all_tools.append(doc_tool)

print(all_tools[0].metadata)

## 输出
ToolMetadata(description='This document provides examples and documentation for
an agent on the llama index platform.', name='tool_latest_index', 
fn_schema=)

第六步：创建ObjectIndex

from llama_index.core import VectorStoreIndex
from llama_index.core.objects import (
    ObjectIndex,
    ObjectRetriever,
)
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.schema import QueryBundle
from llama_index.llms.openai import OpenAI

llm = OpenAI(model_name="gpt-4-0613")

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
)
vector_node_retriever = obj_index.as_node_retriever(
    similarity_top_k=10,
)

# 定义一个自定义对象检索器，添加查询规划工具
class CustomObjectRetriever(ObjectRetriever):
    def __init__(
        self,
        retriever,
        object_node_mapping,
        node_postprocessors=None,
        llm=None,
    ):
        self._retriever = retriever
        self._object_node_mapping = object_node_mapping
        self._llm = llm or OpenAI("gpt-4-0613")
        self._node_postprocessors = node_postprocessors or []

    def retrieve(self, query_bundle):
        if isinstance(query_bundle, str):
            query_bundle = QueryBundle(query_str=query_bundle)

        nodes = self._retriever.retrieve(query_bundle)
        for processor in self._node_postprocessors:
            nodes = processor.postprocess_nodes(
                nodes, query_bundle=query_bundle
            )
        tools = [self._object_node_mapping.from_node(n.node) for n in nodes]

        sub_question_engine = SubQuestionQueryEngine.from_defaults(
            query_engine_tools=tools, llm=self._llm
        )
        sub_question_description = f"""\
Useful for any queries that involve comparing multiple documents. ALWAYS use this tool for comparison queries - make sure to call this \
tool with the original query. Do NOT use the other tools for any queries involving multiple documents.
"""
        sub_question_tool = QueryEngineTool(
            query_engine=sub_question_engine,
            metadata=ToolMetadata(
                name="compare_tool", description=sub_question_description
            ),
        )

        return tools + [sub_question_tool]

# 包装为ObjectRetriever以返回对象
custom_obj_retriever = CustomObjectRetriever(
    vector_node_retriever,
    obj_index.object_node_mapping,
    node_postprocessors=[CohereRerank(top_n=5)],
    llm=llm,
)
from llama_index.agent.openai import OpenAIAgent
from llama_index.core.agent import ReActAgent

top_agent = OpenAIAgent.from_tools(
    tool_retriever=custom_obj_retriever,
    system_prompt=""" \
You are an agent designed to answer queries about the documentation.
Please always use the tools provided to answer a question. Do not rely on prior knowledge.\
""",
    llm=llm,
    verbose=True,
)

第七步：定义基线向量存储索引

all_nodes = [
    n for extra_info in extra_info_dict.values() for n in extra_info["nodes"]
]

base_index = VectorStoreIndex(all_nodes)
base_query_engine = base_index.as_query_engine(similarity_top_k=4)

第八步：运行顶级代理与基线查询的比较

response = top_agent.query(
    "What types of agents are available in LlamaIndex

?",
)

# 输出 

Added user message to memory: What types of agents are available in LlamaIndex?
=== Calling Function ===
Calling function: tool_agents_index with args: {"input":"types of agents"}
Added user message to memory: types of agents
=== Calling Function ===
Calling function: vector_tool_agents_index with args: {
  "input": "types of agents"
}
Got output: The types of agents mentioned in the provided context are ReActAgent, Native OpenAIAgent, OpenAIAgent with Query Engine Tools, OpenAIAgent Query Planning, OpenAI Assistant, OpenAI Assistant Cookbook, Forced Function Calling, Parallel Function Calling, and Context Retrieval.
========================

Got output: The types of agents mentioned in the `agents_index.html` part of the LlamaIndex docs are:

1. ReActAgent
2. Native OpenAIAgent
3. OpenAIAgent with Query Engine Tools
4. OpenAIAgent Query Planning
5. OpenAI Assistant
6. OpenAI Assistant Cookbook
7. Forced Function Calling
8. Parallel Function Calling
9. Context Retrieval
========================
# 基线
response = base_query_engine.query(
    "What types of agents are available in LlamaIndex?",
)
print(str(response))

# 输出

The types of agents available in LlamaIndex are ReActAgent, Native OpenAIAgent,
 and OpenAIAgent.

结论

多文档代理（MDAs）代表了信息检索技术的重要进步，特别是与像LlamaIndex这样强大的平台集成时。

通过利用多个文档中嵌入的集体知识，MDAs能够实现更全面、相关和高效的信息检索。

随着V1 MDAs的引入，其增强的功能如重新排序和查询规划，MDAs的潜在应用几乎是无限的。随着我们继续探索和改进这项技术，可以预期在自然语言处理和信息检索领域会看到更大的进步。

总之，与LlamaIndex结合的MDAs有望革新我们与庞大信息库交互和提取见解的方式，开启智能信息检索系统的新时代。

技术交流&资料

技术要学会分享、交流，不建议闭门造车。一个人可以走的很快、一堆人可以走的更远。

成立了算法面试和技术交流群，相关资料、技术交流&答疑，均可加我们的交流群获取，群友已超过2000人，添加时最好的备注方式为：来源+兴趣方向，方便找到志同道合的朋友。

方式①、微信搜索公众号：机器学习社区，后台回复：加群
方式②、添加微信号：mlc2040，备注：来自CSDN + 技术交流

通俗易懂讲解大模型系列

Python算法实战

关注

27
点赞
踩
19

收藏

觉得还不错? 一键收藏
0
评论
利用大模型 MDAs 解锁多文档中蕴含的集体知识，生成更准确的答案

在深入探讨之前，让我们澄清一些关键概念：多文档代理（MDAs）：这些是能够处理和综合来自多个文档的信息，以提供全面响应用户查询的智能系统。LlamaIndex：一个尖端平台，促进文档索引和检索，作为构建强大MDAs的骨干。import os。
复制链接

扫一扫