电影推荐排名搜索算法

最新推荐文章于 2024-05-03 16:04:41 发布

月疯

最新推荐文章于 2024-05-03 16:04:41 发布

阅读量206

点赞数 10

分类专栏：【NLP】文章标签：人工智能深度学习

本文链接：https://blog.csdn.net/chehec2010/article/details/137054967

版权

【NLP】专栏收录该内容

11 篇文章 0 订阅

订阅专栏

项目的方案：

1、搜集电影的特征，包含（年代，主演，内容，剧情等等）

2、对给定的电影特征进行向量化，用了tensorflow(one-hot)

3、调用faiss进行制定索引和相似话查找，采用欧氏距离

4、给定目标向量，搜索向量集合，查找最近的目标

5、展示结果

python3

安装faiss
pip install faiss-cpu -i https://pypi.tuna.tsinghua.edu.cn/simple/

数据集地址：

https://download.csdn.net/download/chehec2010/89036813

实现的代码DEMO：

import pandas as pd
import numpy as np
import json
import faiss
from tensorflow.python.keras.layers.embeddings import Embedding
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot

#数据预处理，向量化

def dataProcess():
    df = pd.read_csv('../movies.csv')
    # 俩列合并为一列
    df["ht"] = df["title"] + df["genres"]
    #保存mivied \ ht
    df.to_csv('../Httmovie.csv', index=False)
    htt = df["ht"].tolist()
    print(htt[:4])

    vocab_size = 1000  # 将词汇表设置为1000
    # one_hot
    encode = [one_hot(d, vocab_size) for d in htt]

    # 用padding进行填充，保证长度一直，设立都设置为4
    max_length = 10
    # padding填充最大的词汇长度，0向后填充(padding=post)
    padding_docs = pad_sequences(encode, maxlen=max_length, padding='post')
    print(padding_docs.shape)
    print(padding_docs[:10])
    # 转换为字符串存入csv
    bet = [json.dumps(lu.tolist()) for lu in padding_docs]
    # 将向量保存到csv
    test = pd.DataFrame(bet, columns=['features'])
    # 创建一列为id列
    test['id'] = range(1, len(test) + 1)
    # 调换列'id'和列'features'的位置
    test = test[['id', 'features']]
    # # 转换csv
    test.to_csv('../movie.csv', sep=',', encoding='utf-8', index=True)

# dataProcess()
#读取影评信息
df = pd.read_csv('../movie.csv')
#查看信息
df.head()

#构建ids
ids = df['id'].values.astype(np.int64)

#特征数据格式转换
datas = []

for x in df['features']:
    #str专为json
    datas.append(json.loads(x))

#json转为array
datas = np.array(datas).astype(np.float32)
print(datas.shape,type(datas))

print(datas[0]) #打印第一条数据看看

#建立索引
index = faiss.IndexFlatL2(datas.shape[1]) #基于欧式距离建立索引
index2 = faiss.IndexIDMap(index)  #基于第一个索引去建立第二个索引

#制定ids为datas的索引
index2.add_with_ids(datas,ids)
#查看电影索引
print(index2.ntotal)

#加载电影向量 （根据id是第一个的寻找其相似的5个）
item_embedding = np.array(json.loads(df[df['id']==1]['features'].iloc[0]))

#添加维度,就是[] 变为[[]]
item_embedding = np.expand_dims(item_embedding,axis=0).astype(np.float32)
print(item_embedding)

print(item_embedding.dtype)

#neighbor num=5 定义紧邻检索的数量
topn =5
#dis and idx(D表示各个邻居之间的距离列表，I是最近各个邻居的下标)
D,I = index2.search(item_embedding,topn)

#查看索引的下标
print(I)
print(I.shape)

#生成推荐列表
target_ids = pd.Series(I[0],name="movieId")
#读取原来的列表
df_movies = pd.read_csv('../Httmovie.csv')
# 删除title和genres
df_movies = df_movies.drop(columns=["title" + "genres"])

#进行合并
df_result = pd.merge(target_ids,df_movies)

print(df_result)

正确的帮你检索出最靠近的5个：

部分代码解释：

import pandas as pd
import numpy as np
import json
import faiss
from tensorflow.python.keras.layers.embeddings import Embedding
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot

df = pd.read_csv('../movie.csv')
test=df[df['id']==1]['features']
print(test)
#0    [137, 92, 518, 382, 607, 883, 89, 650, 0, 0]
test = test.iloc[0] #第0行的元素
print(test)
#     [137, 92, 518, 382, 607, 883, 89, 650, 0, 0]
item_embedding = np.expand_dims(json.loads(test),axis=0).astype(np.float32)
#json.loads(test)是将'[137, 92, 518, 382, 607, 883, 89, 650, 0, 0]' 转化为[137, 92, 518, 382, 607, 883, 89, 650, 0, 0]
print(item_embedding)   #[[137.  92. 518. 382. 607. 883.  89. 650.   0.   0.]]
print(item_embedding.shape) #(1, 10)

'''
执行程序后注意观察中括号[ ]的位置和数量

np.expand_dims(a, axis=0)表示在axis=0维度处扩展维度，加一层中括号[ ];

np.expand_dims(a, axis=1)表示在axis=1维度处扩展维度，加一层中括号[ ];

np.expand_dims(a, axis=2)表示在axis=2维度处扩展维度，加一层中括号[ ];

np.expand_dims(a, axis=-1)表示在axis=-1(最后)维度处扩展维度，加一层中括号[ ];

'''

faiss测试：

import numpy as np
import faiss

np.random.seed(42)

num_vectors = 1000
dim = 64

# 创建1000个向量，每个向量的维度是64
vectors = np.random.rand(num_vectors, dim)
# 创建基于L2的索引
index = faiss.IndexFlatL2(dim)
# 将向量入库
index.add(vectors)

# 创建5个查询向量
queries = np.random.rand(5, dim)
# 要检索的近邻数量
k = 3
# D是距离数组，I是索引数组
D, I = index.search(queries, k)

print("查询结果:")
print("距离:\n", D)
print("索引:\n", I)

'''
默认情况下，每个向量的索引是按照它们入库的顺序进行分配的。但有些时候，我们希望自己给这些向量分配ID，这时候就需要用上 IndexIDMap
具体来讲，IndexIDMap 是一个包装器（wrapper），它在任何 FAISS 索引的基础上工作。首先创建一个基础索引（例如，IndexFlatL2 用于 L2 距离），
然后使用 IndexIDMap 将这个索引包装起来。这样，基础索引负责计算向量间的相似性，而 IndexIDMap 负责处理与这些向量相关联的 ID。
然后我们使用 add_with_ids 来手动为要入库的向量分配索引



'''

参考地址：https://blog.csdn.net/raelum/article/details/135047797

出现一个bug记录一下

index = faiss.IndexFlatL2(dim)
AttributeError: partially initialized module 'faiss' has no attribute 'IndexFlatL2' (most likely due to a circular import)

主要是测试过程中faiss.py文件导致问题，删除掉就好了