最近增在倒腾电影推荐算法,看了一些文章,照葫芦画瓢的练了一下手。
算法为协同过滤算法,
该算法的好处是:
- 能起到意想不到的推荐效果, 经常能推荐出来一些惊喜结果
- 进行有效的长尾item
- 只依赖用户行为, 不需要对内容进行深入了解, 使用范围广
该算法的坏处是:
- 一开始需要大量的
<user,item>
行为数据, 即需要大量冷启动数据 - 很难给出合理的推荐解释
以下代码来自于这篇文章:碟中谍这么火,我用机器学习做个迷你推荐系统电影
文章作者使用的是从movielen网上获得的数据集!ml-100k.zip (size: 5 MB, checksum)
不过本人在使用的过程中,u.item因为编码的问题有错误,转为UTF-8编码即可。
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
user_file = 'G:\\DeepLearn\\训练集\\ml-100k\\u.data'
movie_file = 'G:\\DeepLearn\\训练集\\ml-100k\\u.item'
user_df = pd.read_csv(user_file, sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])
movie_df = pd.read_csv(movie_file, encoding='utf-8', sep='|', usecols=['movie_id', 'movie_title', 'release_date'],
names=['movie_id', 'movie_title', 'release_date', 'video_release_date', 'IMDb URL', 'unknown',
'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary',
'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
'Thriller', 'War', 'Western'])
n_users = user_df.user_id.unique().shape[0]
n_items = user_df.item_id.unique().shape[0]
data_matrix = np.zeros((n_users, n_items))
for line in user_df.itertuples():
data_matrix[line[1] - 1, line[2] - 1] = line[3]
print(data_matrix)
item_similarity = cosine_similarity(data_matrix.T, dense_output=True)
print(item_similarity)
def movie_rescsys(keywords, k):
keywords = keywords.title()
movie_list = []
try:
movieid = list(movie_df[movie_df['movie_title'].str.contains(keywords)].movie_id)[0]
movie_similarity = item_similarity[movieid - 1]
movie_similarity_index = np.argsort(-movie_similarity)[:k + 1]
for i in movie_similarity_index:
rec_movies = list(movie_df[movie_df.movie_id == (i + 1)].movie_title)
rec_movies.append(movie_similarity[i])
rec_movies.append(len(user_df[user_df.item_id == (i + 1)]))
rec_movies.append(user_df[user_df.item_id == (i + 1)]['rating'].mean())
movie_list.append(rec_movies)
except:
print('提示:找不到该电影!请检查输入!')
return movie_list
def main():
name = input('请输入电影名称或者关键词:')
num = int(input('您希望找到多少部相关电影(请输入整数):'))
result = movie_rescsys(name, num)
print(result)
movie_id = list(movie_df[movie_df['movie_title'] == result[0][0]].movie_id)[0]
choice = input('是否设置相似度阈值(y/n)')
main()
不得不说,这个算法还是不错的,搜索出来的电影相关度非常好。所以本人有点蠢蠢欲动,想用更大一点的数据集进行训练,看看效果如何。选择了稍微大一点的数据集:ml-latest-small.zip (size: 1 MB)
但是把这个数据集直接带入原来的代码中,是不行的。这个数据集的电影表的序号不是按顺序排列的,虽然只用9742条电影,但是序号居然排到了193609。所以在带入原有代码的时候,运行到这行便会出现越界问题:
for line in user_df.itertuples():
data_matrix[line[1] - 1, line[2] - 1] = line[3]
本人的解决方法是直接将原有数据集进行改造,将电影表的序号顺序排列,并将打分表的电影序号替换成新序号。
user_file = 'G:\\DeepLearn\\训练集\\ml-latest-small\\ratings.csv'
movie_file = 'G:\\DeepLearn\\训练集\\ml-latest-small\\movies.csv'
user_df = pd.read_csv(user_file, sep=',', names=['user_id', 'item_id', 'rating', 'timestamp'])
movie_df = pd.read_csv(movie_file, encoding='utf-8', sep=',', usecols=['movie_id', 'movie_title', 'Genres'],
names=['movie_id', 'movie_title', 'Genres'])
i = 1
file_txt = '';
movie_id_dic = {}
for line in movie_df.itertuples():
movie_id_dic[line[1]] = i
file_txt += '{},{},{}\r\n'.format(i, line[2], line[3])
# print('{},{},{}\r\n'.format(i, line[2], line[3]))
# print('{},{},{}'.format(i, line[1], line[2]))
i += 1
f = open('G:\\DeepLearn\\训练集\\ml-latest-small\\movies_new.csv', 'w+', encoding='utf-8', newline='')
f.seek(0)
f.truncate() # 清空文件
f.write(file_txt)
f.close()
i = 1
file_txt = '';
for line in user_df.itertuples():
file_txt += '{},{},{},{}\r\n'.format(line[1], movie_id_dic[line[2]], line[3], line[4])
# print('{},{},{},{}\r\n'.format(line[1], movie_id_dic[line[2]], line[3], line[4]))
# print('{},{},{}'.format(i, line[1], line[2]))
i += 1
f = open('G:\\DeepLearn\\训练集\\ml-latest-small\\ratings_new.csv', 'w+', encoding='utf-8', newline='')
f.seek(0)
f.truncate() # 清空文件
f.write(file_txt)
f.close()
user_file = 'G:\\DeepLearn\\训练集\\ml-latest-small\\ratings_new.csv'
movie_file = 'G:\\DeepLearn\\训练集\\ml-latest-small\\movies_new.csv'
user_df = pd.read_csv(user_file, sep=',', names=['user_id', 'item_id', 'rating', 'timestamp'])
movie_df = pd.read_csv(movie_file, encoding='utf-8', sep=',', usecols=['movie_id', 'movie_title', 'Genres'],
names=['movie_id', 'movie_title', 'Genres'])
n_users = user_df.user_id.unique().shape[0]
print(n_users)
# n_items = user_df.item_id.unique().shape[0]
n_items = len(movie_id_dic)
print(n_items)
data_matrix = np.zeros((n_users, n_items))
for line in user_df.itertuples():
# print(line)
data_matrix[line[1] - 1, line[2] - 1] = line[3]
item_similarity = cosine_similarity(data_matrix.T, dense_output=True)
print(item_similarity)
这样就可以,其他代码不变,有点简单粗暴。有一点还需要提一下,本人在使用这个新的CVS版的数据集之前,为了省事,把header行去掉了。
以下是效果:
请输入电影名称或者关键词:dead
您希望找到多少部相关电影(请输入整数):10
[
['Dracula: Dead and Loving It (1995)', 1.0, 19, 2.4210526315789473],
['Fog', 0.567672, 5, 2.1], ['Basic (2003)', 0.53364, 2, 3.75],
['Wrong Turn (2003)', 0.487194, 2, 3.0],
['Batman Beyond: Return of the Joker (2000)', 0.464686, 3, 3.5],
['Batman: Under the Red Hood (2010)', 0.462137, 3, 3.6666666666666665],
['Friday the 13th (2009)', 0.449013, 2, 3.0],
['Whistleblower', 0.448733, 2, 3.25],
['Ready to Rumble (2000)', 0.445439, 6, 2.25],
['RocketMan (a.k.a. Rocket Man) (1997)', 0.443472, 7, 2.142857142857143],
['All Dogs Go to Heaven 2 (1996)', 0.441771, 11, 3.1818181818181817]
]
这个系统有一个缺陷,就是在搜索关键词的时候,使用字符串比对,以第一个比对上的电影为蓝本进行推荐。所以,虽然推荐的效果已经算不错,但是在以关键词搜索电影的方面,还需要优化。