使用MongoDB进行电影数据的向量搜索与推荐系统

最新推荐文章于 2024-08-06 17:28:55 发布

ppoojjj

最新推荐文章于 2024-08-06 17:28:55 发布

阅读量599

点赞数 9

文章标签： mongodb 数据库 python

本文链接：https://blog.csdn.net/ppoojjj/article/details/140365582

版权

在本篇文章中，我们将展示如何使用MongoDB进行电影数据的向量搜索与推荐系统。我们将使用OpenAI的嵌入模型，并结合LlamaIndex库来实现这一过程。具体步骤包括数据预处理、嵌入生成、数据库设置与查询等。

环境设置与安装依赖

首先，确保你已经安装了必要的库。以下是需要安装的库及其安装命令：

!pip install llama-index
!pip install llama-index-vector-stores-mongodb
!pip install llama-index-embeddings-openai
!pip install pymongo
!pip install datasets
!pip install pandas

从Hugging Face加载数据集

我们将使用一个嵌入电影数据集，并将其转换为Pandas DataFrame，以便后续处理。

%env OPENAI_API_KEY=OPENAI_API_KEY

from datasets import load_dataset
import pandas as pd

# 加载数据集
dataset = load_dataset("AIatMongoDB/embedded_movies")

# 转换为DataFrame
dataset_df = pd.DataFrame(dataset["train"])

# 显示前5行数据
dataset_df.head(5)

数据预处理

在这个步骤中，我们进行数据清洗和去除不必要的嵌入信息，以便重新使用新的嵌入模型生成嵌入。

# 移除缺失值
dataset_df = dataset_df.dropna(subset=["fullplot"])
print("\nNumber of missing values in each column after removal:")
print(dataset_df.isnull().sum())

# 移除旧的嵌入信息
dataset_df = dataset_df.drop(columns=["plot_embedding"])

dataset_df.head(5)

嵌入模型与LlamaIndex配置

使用OpenAI的嵌入模型，并配置LlamaIndex的相关设置。

from llama_index.core.settings import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small", dimensions=256)
llm = OpenAI()

Settings.llm = llm
Settings.embed_model = embed_model

数据转换为文档格式

将DataFrame中的数据转换为LlamaIndex可以处理的文档格式，并生成嵌入。

import json
from llama_index.core import Document
from llama_index.core.schema import MetadataMode

# 将DataFrame转换为JSON字符串
documents_json = dataset_df.to_json(orient="records")
# 加载JSON字符串为Python列表
documents_list = json.loads(documents_json)

llama_documents = []

for document in documents_list:
    document["writers"] = json.dumps(document["writers"])
    document["languages"] = json.dumps(document["languages"])
    document["genres"] = json.dumps(document["genres"])
    document["cast"] = json.dumps(document["cast"])
    document["directors"] = json.dumps(document["directors"])
    document["countries"] = json.dumps(document["countries"])
    document["imdb"] = json.dumps(document["imdb"])
    document["awards"] = json.dumps(document["awards"])

    llama_document = Document(
        text=document["fullplot"],
        metadata=document,
        excluded_llm_metadata_keys=["fullplot", "metacritic"],
        excluded_embed_metadata_keys=[
            "fullplot",
            "metacritic",
            "poster",
            "num_mflix_comments",
            "runtime",
            "rated",
        ],
        metadata_template="{key}=>{value}",
        text_template="Metadata: {metadata_str}\n-----\nContent: {content}",
    )

    llama_documents.append(llama_document)

连接到MongoDB Atlas

确保你的MongoDB集群已设置，并获取连接字符串。以下代码演示如何连接到MongoDB Atlas。

import pymongo
from google.colab import userdata

def get_mongo_client(mongo_uri):
    try:
        client = pymongo.MongoClient(mongo_uri)
        print("Connection to MongoDB successful")
        return client
    except pymongo.errors.ConnectionFailure as e:
        print(f"Connection failed: {e}")
        return None

mongo_uri = userdata.get("MONGO_URI")
if not mongo_uri:
    print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

DB_NAME = "movies"
COLLECTION_NAME = "movies_records"

db = mongo_client[DB_NAME]
collection = db[COLLECTION_NAME]

# 删除集合中的现有记录
collection.delete_many({})

向量搜索索引创建与查询

在MongoDB中创建向量搜索索引，并进行查询操作。

from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch

vector_store = MongoDBAtlasVectorSearch(
    mongo_client,
    db_name=DB_NAME,
    collection_name=COLLECTION_NAME,
    index_name="vector_index",
)
vector_store.add(nodes)

from llama_index.core import VectorStoreIndex, StorageContext

index = VectorStoreIndex.from_vector_store(vector_store)

import pprint
from llama_index.core.response.notebook_utils import display_response

query_engine = index.as_query_engine(similarity_top_k=3)

query = "Recommend a romantic movie suitable for the christmas season and justify your selecton"

response = query_engine.query(query)
display_response(response)
pprint.pprint(response.source_nodes)

可能遇到的错误

连接失败：如果在连接MongoDB时出现连接失败的错误，确保你的MongoDB URI正确，并且MongoDB集群已经正确配置允许你的IP连接。
数据缺失：在数据预处理中，可能会由于数据的缺失导致某些操作失败。确保在进行分析之前对数据进行恰当的清理。
索引创建失败：如果在MongoDB中创建向量搜索索引失败，确保你的数据库和集合配置正确，并参考MongoDB的官方文档进行向量搜索索引创建。

如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!

参考资料:

ppoojjj

关注

9
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
使用MongoDB进行电影数据的向量搜索与推荐系统

在本篇文章中，我们将展示如何使用MongoDB进行电影数据的向量搜索与推荐系统。我们将使用OpenAI的嵌入模型，并结合LlamaIndex库来实现这一过程。具体步骤包括数据预处理、嵌入生成、数据库设置与查询等。
复制链接

扫一扫