利用RAG(检索增强生成)技术实现多文档检索

在处理多文档检索时,使用RAG(检索增强生成)框架能帮助我们更有效地从不同文档中获取相关信息。本文将介绍如何使用LlamaIndex和Weaviate来实现这种多文档检索。

安装依赖

首先,我们需要安装一些必要的库:

%pip install llama-index-readers-github
%pip install llama-index-vector-stores-weaviate
!pip install llama-index llama-hub

下载和设置数据

在此部分,我们将加载LlamaIndex的GitHub问题数据。

import nest_asyncio
import os
from llama_index.readers.github import GitHubRepositoryIssuesReader, GitHubIssuesClient

nest_asyncio.apply()

os.environ["GITHUB_TOKEN"] = "ghp_..."
os.environ["OPENAI_API_KEY"] = "sk-..."

github_client = GitHubIssuesClient()
loader = GitHubRepositoryIssuesReader(github_client, owner="run-llama", repo="llama_index", verbose=True)
orig_docs = loader.load_data()

# 限制为100个文档
limit = 100
docs = []
for idx, doc in enumerate(orig_docs):
    doc.metadata["index_id"] = int(doc.id_)
    if idx >= limit:
        break
    docs.append(doc)

设置向量存储和索引

import weaviate
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core import VectorStoreIndex, StorageContext

auth_config = weaviate.AuthApiKey(api_key="XRa15cDIkYRT7AkrpqT6jLfE4wropK1c1TGk")
client = weaviate.Client("https://llama-index-test-v0oggsoz.weaviate.network", auth_client_secret=auth_config)
class_name = "LlamaIndex_docs"

client.schema.delete_class(class_name)  # 可选:删除已有的Schema

vector_store = WeaviateVectorStore(weaviate_client=client, index_name=class_name)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
doc_index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)

创建检索和过滤的IndexNode

from llama_index.core import SummaryIndex
from llama_index.core.async_utils import run_jobs
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import IndexNode
from llama_index.core.vector_stores import FilterOperator, MetadataFilter, MetadataFilters

async def aprocess_doc(doc, include_summary: bool = True):
    metadata = doc.metadata
    date_tokens = metadata["created_at"].split("T")[0].split("-")
    year, month, day = int(date_tokens[0]), int(date_tokens[1]), int(date_tokens[2])
    assignee = "" if "assignee" not in doc.metadata else doc.metadata["assignee"]
    size = ""
    if len(doc.metadata["labels"]) > 0:
        size_arr = [l for l in doc.metadata["labels"] if "size:" in l]
        size = size_arr[0].split(":")[1] if len(size_arr) > 0 else ""
    new_metadata = {"state": metadata["state"], "year": year, "month": month, "day": day, "assignee": assignee, "size": size}

    summary_index = SummaryIndex.from_documents([doc])
    query_str = "Give a one-sentence concise summary of this issue."
    query_engine = summary_index.as_query_engine(llm=OpenAI(model="gpt-3.5-turbo"))
    summary_txt = await query_engine.aquery(query_str)
    summary_txt = str(summary_txt)

    index_id = doc.metadata["index_id"]
    filters = MetadataFilters(filters=[MetadataFilter(key="index_id", operator=FilterOperator.EQ, value=int(index_id))])

    index_node = IndexNode(text=summary_txt, metadata=new_metadata, obj=doc_index.as_retriever(filters=filters), index_id=doc.id_)
    return index_node

async def aprocess_docs(docs):
    tasks = [aprocess_doc(doc) for doc in docs]
    index_nodes = await run_jobs(tasks, show_progress=True, workers=3)
    return index_nodes

index_nodes = await aprocess_docs(docs)

创建顶级AutoRetriever

加载总结的元数据和原始文档到向量数据库中,可以执行结构化、分层的检索策略。

import weaviate
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core import VectorStoreIndex, StorageContext

auth_config = weaviate.AuthApiKey(api_key="XRa15cDIkYRT7AkrpqT6jLfE4wropK1c1TGk")
client = weaviate.Client("https://llama-index-test-v0oggsoz.weaviate.network", auth_client_secret=auth_config)
class_name = "LlamaIndex_auto"

client.schema.delete_class(class_name)  # 可选:删除已有的Schema

vector_store_auto = WeaviateVectorStore(weaviate_client=client, index_name=class_name)
storage_context_auto = StorageContext.from_defaults(vector_store=vector_store_auto)

index = VectorStoreIndex(objects=index_nodes, storage_context=storage_context_auto)

设置可组合的Auto-Retriever

from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo

vector_store_info = VectorStoreInfo(
    content_info="Github Issues",
    metadata_info=[
        MetadataInfo(name="state", description="Whether the issue is `open` or `closed`", type="string"),
        MetadataInfo(name="year", description="The year issue was created", type="integer"),
        MetadataInfo(name="month", description="The month issue was created", type="integer"),
        MetadataInfo(name="day", description="The day issue was created", type="integer"),
        MetadataInfo(name="assignee", description="The assignee of the ticket", type="string"),
        MetadataInfo(name="size", description="How big the issue is (XS, S, M, L, XL, XXL)", type="string"),
    ],
)

from llama_index.core.retrievers import VectorIndexAutoRetriever

retriever = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    similarity_top_k=2,
    empty_query_top_k=10,
    verbose=True,
)

尝试检索

from llama_index.core import QueryBundle

nodes = retriever.retrieve(QueryBundle("Tell me about some issues on 01/11"))
print(f"Number of source nodes: {len(nodes)}")
print(nodes[0].node.metadata)

集成RetrieverQueryEngine

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")
query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)

response = query_engine.query("Tell me about some issues on 01/11")
print(str(response))

总结

本文展示了如何通过结构化的检索层从文档总结中提取相关信息,并基于用户查询动态地选择相关的文档。这种方法不仅适用于RAG,还可以应用于多文档代理设置中。

参考资料

可能遇到的错误

  1. 权限错误: 请确保你的GitHub Token和OpenAI API Key设置正确。
  2. 网络错误: 在使用API时,如果网络不稳定,可能会出现超时或连接失败的情况。
  3. 数据加载错误: 如果GitHub仓库中没有足够的issue数据,可能导致文档加载失败。

如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!

  • 5
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值