构建基于Google Drive文件的实时RAG管道

最新推荐文章于 2024-07-31 11:41:46 发布

qq_29929123

最新推荐文章于 2024-07-31 11:41:46 发布

阅读量397

点赞数 4

文章标签： bootstrap 前端 html python

本文链接：https://blog.csdn.net/qq_29929123/article/details/139798212

版权

在本指南中，我们将向您展示如何构建一个基于Google Drive文件的"实时"RAG（Retrieval-Augmented Generation）管道。

这个管道会将Google Drive文件索引并导入到Redis向量存储中。之后，每次重新运行数据摄取管道时，管道将传播增量更新，这样只有更改过的文档才会在向量存储中更新。这意味着我们不需要重新索引所有文档！

设置

我们安装所需的软件包并启动Redis Docker镜像。

%pip install llama-index-storage-docstore-redis
%pip install llama-index-vector-stores-redis
%pip install llama-index-embeddings-huggingface
%pip install llama-index-readers-google

# 如果是新建容器
!docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
# 如果是启动已存在的容器
# !docker start -a redis-stack

定义数据摄取管道

在这里，我们定义数据摄取管道。给定一组文档，我们将运行句子分割/嵌入转换，然后将它们加载到Redis文档存储/向量存储中。

import os

os.environ["OPENAI_API_KEY"] = "sk-..."

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.ingestion import (
    DocstoreStrategy,
    IngestionPipeline,
    IngestionCache,
)
from llama_index.core.ingestion.cache import RedisCache
from llama_index.storage.docstore.redis import RedisDocumentStore
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.redis import RedisVectorStore

vector_store = RedisVectorStore(
    index_name="redis_vector_store",
    index_prefix="vectore_store",
    redis_url="redis://localhost:6379",
)

cache = IngestionCache(
    cache=RedisCache.from_host_and_port("localhost", 6379),
    collection="redis_cache",
)

# 可选：如果向量存储存在，清除它
if vector_store._index_exists():
    vector_store.delete_index()

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        embed_model,
    ],
    docstore=RedisDocumentStore.from_host_and_port(
        "localhost", 6379, namespace="document_store"
    ),
    vector_store=vector_store,
    cache=cache,
    docstore_strategy=DocstoreStrategy.UPSERTS,
)

定义向量存储索引

我们定义索引来封装底层的向量存储。

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, embed_model=embed_model
)

加载初始数据

在这里，我们从LlamaHub上的Google Drive加载器中加载数据。

from llama_index.readers.google import GoogleDriveReader

loader = GoogleDriveReader()

def load_data(folder_id: str):
    docs = loader.load_data(folder_id=folder_id)
    for doc in docs:
        doc.id_ = doc.metadata["file_name"]
    return docs

docs = load_data(folder_id="1RFhr3-KmOZCR5rtp4dlOMNl3LKe1kOA5")

nodes = pipeline.run(documents=docs)
print(f"Ingested {len(nodes)} Nodes")

查询初始数据

query_engine = index.as_query_engine()

response = query_engine.query("What are the sub-types of question answering?")

print(str(response))

修改并重新加载数据

让我们尝试修改我们已摄取的数据！

docs = load_data(folder_id="1RFhr3-KmOZCR5rtp4dlOMNl3LKe1kOA5")
nodes = pipeline.run(documents=docs)
print(f"Ingested {len(nodes)} Nodes")

查询新数据

query_engine = index.as_query_engine()

response = query_engine.query("What are the sub-types of question answering?")

print(str(response))

可能遇到的错误

网络连接错误：确保网络连接正常，并且能够访问Redis服务器和Google Drive API。
API密钥无效：确保在环境变量中设置了有效的API密钥，并且API密钥有正确的权限。
文件权限错误：确保Google Drive文件和目录的权限设置正确，允许API进行读取操作。
依赖包版本冲突：安装依赖包时，可能会遇到版本冲突的问题，建议使用虚拟环境来管理依赖包。

如果你觉得这篇文章对你有帮助，请点赞，关注我的博客，谢谢!

参考资料:

qq_29929123

关注

4
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
构建基于Google Drive文件的实时RAG管道

在这里，我们定义数据摄取管道。给定一组文档，我们将运行句子分割/嵌入转换，然后将它们加载到Redis文档存储/向量存储中。import os# 可选：如果向量存储存在，清除它],),我们定义索引来封装底层的向量存储。
复制链接

扫一扫