使用 MongoDB Atlas 和 OpenAI 实现自查询向量检索

最新推荐文章于 2025-04-04 20:39:13 发布

qahaj

最新推荐文章于 2025-04-04 20:39:13 发布

阅读量330

点赞数 4

文章标签： mongodb python 线性代数

本文链接：https://blog.csdn.net/qahaj/article/details/145767590

版权

在本文中，我们将探讨如何使用 MongoDB Atlas 构建一个向量存储并结合 OpenAI 实现自查询检索。通过创建一个 MongoDB Atlas VectorStore，我们可以存储和检索文档的向量表示。这一强大功能可用于多种场景，如电影摘要检索。以下是具体的实现步骤。

技术背景介绍

MongoDB Atlas 是一种高性能的文档数据库，支持向量存储，可以借助其强大的搜索能力处理海量数据。在结合 OpenAI 的嵌入服务后，我们可以实现智能的文档检索。Langchain 提供了灵活的 API，我们将利用其构建一个自查询检索器（SelfQueryRetriever）。

核心原理解析

通过构建向量存储，文档被转换为可以进行相似度比较的嵌入向量。然后，我们可以构建自查询检索系统，根据用户输入的自然语言查询进行相关文档检索，支持复杂的过滤条件和高效查询。

代码实现演示

以下代码展示了如何使用 MongoDB Atlas 创建向量存储，并结合 OpenAI 进行自查询检索：

# 确保安装了必要的库：lark 和 pymongo
%pip install --upgrade --quiet lark pymongo

import openai
import os
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from pymongo import MongoClient
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import OpenAI

# 配置 OpenAI API 密钥
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

# 配置 MongoDB Atlas 连接
CONNECTION_STRING = "your-mongodb-atlas-connection-string"
DB_NAME = "your-database"
COLLECTION_NAME = "your-collection"
INDEX_NAME = "your-index"

# 创建 MongoDB 客户端
client = MongoClient(CONNECTION_STRING)
collection = client[DB_NAME][COLLECTION_NAME]

# 使用 OpenAI 创建嵌入
embeddings = OpenAIEmbeddings()

# 示例文档
docs = [
    Document(page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata={"year": 1993, "rating": 7.7, "genre": "action"}),
    Document(page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", metadata={"year": 2010, "genre": "thriller", "rating": 8.2}),
    # 更多示例文档...
]

# 创建 VectorStore
vectorstore = MongoDBAtlasVectorSearch.from_documents(docs, embeddings, collection=collection, index_name=INDEX_NAME)

# 定义元数据字段信息
metadata_field_info = [
    AttributeInfo(name="genre", description="The genre of the movie", type="string"),
    AttributeInfo(name="year", description="The year the movie was released", type="integer"),
    AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]

# 文档内容描述
document_content_description = "Brief summary of a movie"

# 创建自查询检索器
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)

# 测试检索功能
retriever.invoke("What are some movies about dinosaurs")
retriever.invoke("What are some highly rated movies (above 9)?")
retriever.invoke("I want to watch a movie about toys rated higher than 9")
retriever.invoke("What's a highly rated (above or equal 9) thriller film?")
retriever.invoke("What's a movie after 1990 but before 2005 that's all about dinosaurs, and preferably has a lot of action")

# 使用限制查询结果数量的功能
retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True, enable_limit=True)
retriever.invoke("What are two movies about dinosaurs?")