基于hadoop的商品推荐系统_用加权矩阵分解模型实现基于电影评分的推荐系统

最新推荐文章于 2022-04-28 11:05:00 发布

weixin_39997310

最新推荐文章于 2022-04-28 11:05:00 发布

阅读量418

点赞数

文章标签：基于hadoop的商品推荐系统

本文介绍了如何使用TensorFlow库中的WALSModel实现加权矩阵分解，通过处理电影评分数据，构建推荐系统，为用户个性化推荐电影。步骤包括数据预处理、矩阵构建、模型训练与评估，以及推荐算法的应用。

摘要由CSDN通过智能技术生成

本内容取之电子工业出版社出版、李金洪编著的《深度学习之TensorFlow工程化项目实战》一书的实例36。

通过调用TensorFlow中的tensorflow.contrib.factorization.WALSModel接口实现一个加权矩阵分解（WALS）模型，并用该模型实现基于电影评分的推荐系统。

实例描述

有一个电影评分数据集，里面包含用户、电影、评分、时间字段。

要求设计模型，并用模型学习该数据的规律，为用户推荐喜欢看的其他电影。

一、下载并加载数据集

通过以下链接，将电影评论数据集下载到本地：

http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

下载之后，将其解压缩到本地代码的同级目录下，并按照以下步骤具体操作。

1. 使用数据集

在电影评论数据集中有以下几个文件：

links.csv。

movies.csv。

ratings.csv。

README.txt。

tags.csv。

这里只关心评分文件，即ratings.csv。其内容如下：

userId,movieId,rating,timestamp

1,31,2.5,1260759144

1,1029,3.0,1260759179

2. 代码实现：读取数据集，并按照时间排序

将数据加载到内存中，并按照时间对其排序。具体代码如下：

代码1 电影推荐系统

01	import os
02	
03	DATASET_PATH= 'ml-latest-small'
04	RATINGS_CSV = os.path.join(DATASET_PATH, 'ratings.csv')	#指定路径
05	
06	import collections
07	import csv
08	
09	Rating = collections.namedtuple('Rating', ['user_id', 'item_id', 'rating', 'timestamp'])
10	ratings = list()
11	with open(RATINGS_CSV, newline='') as f: 				#加载数据
12	    reader = csv.reader(f)
13	    next(reader) #跳过第一行的字段描述部分
14	    for user_id, item_id, rating, timestamp in reader:
15	        ratings.append(Rating(user_id, item_id, float(rating), int(timestamp)))
16	
17	ratings = sorted(ratings, key=lambda r: r.timestamp) 	#排序
18	print('Ratings: {:,}'.format(len(ratings)))

代码运行后，显示以下结果：

Ratings: 100,004

输出结果中的“100,004”表示数据集的总条数为100 004条。

二、代码实现：根据用户和电影特征列生成稀疏矩阵

本小节的具体步骤如下：

（1）将用户数据与电影数据单独抽取出来。

（2）根据抽取出的数据索引生成字典。

（3）按照用户与电影两个维度生成网格矩阵。

（4）将该网格矩阵保存为稀疏矩阵。

具体代码如下：

代码1 电影推荐系统（续）

19	import tensorflow as tf
20	import numpy as np
21	
22	users_from_idx = sorted(set(r.user_id for r in ratings), key=int)#获得用户ID
23	users_from_idx = dict(enumerate(users_from_idx)#生成索引与用户ID的正反向字典
24	users_to_idx = dict((user_id, idx) for idx, user_id in users_from_idx.items())
25	print('User Index:',[users_from_idx[i] for i in range(2)])
26	#获得电影的ID
27	items_from_idx = sorted(set(r.item_id for r in ratings), key=int)	
28	items_from_idx = dict(enumerate(items_from_idx)#生成索引与电影ID的正反向字典
29	items_to_idx = dict((item_id, idx) for idx, item_id in items_from_idx.items())
30	print('Item Index:',[items_from_idx[i] for i in range(2)])
31	
32	sess = tf.InteractiveSession()					#将用户与电影交叉。填入评分
33	indices = [(users_to_idx[r.user_id], items_to_idx[r.item_id]) for r in ratings]
34	values = [r.rating for r in ratings]
35	n_rows = len(users_from_idx)
36	n_cols = len(items_from_idx)
37	shape = (n_rows, n_cols)
38	
39	P = tf.SparseTensor(indices, values, shape)			#生成稀疏矩阵
40	
41	print(P)
42	print('Total values: {:,}'.format(n_rows * n_cols))

代码运行后，输出以下结果：

User Index: ['1', '2']
Item Index: ['1', '2']
SparseTensor(indices=Tensor("SparseTensor_11/indices:0", shape=(100004, 2), dtype=int64), values=Tensor("SparseTensor_11/values:0", shape=(100004,), dtype=float32), dense_shape=Tensor("SparseTensor_11/dense_shape:0", shape=(2,), dtype=int64))
Total values: 6,083,286

在输出的结果中可以看到：

前两行分别显示了用户的ID与电影的ID。

第3行显示了所生成的稀疏矩阵。

最后一行显示了将用户与电影交叉后的矩阵大小为6,083,286。

提示：

程序最终生成的矩阵尺寸非常巨大。对于超大矩阵最好的处理方法是，将其存储为稀疏矩阵。如果以稠密矩阵的形式存放到内存中，则会非常耗资源。

三、代码实现：建立WALS模型，并对其进行训练

调用tensorflow.contrib.factorization.WALSModel接口，建立WALS模型。WALSModel接口支持分布式训练和正则化处理。具体参数可以使用help命令查看。

具体代码如下：

代码1 电影推荐系统（续）

43	from tensorflow.contrib.factorization import WALSModel
44	k = 10					#分解后的维度
45	n = 10					#训练的迭代次数
46	reg = 1e-1				#正则化的权重
47	
48	model = WALSModel(		#创建WALSModel
49	    n_rows,				#行数
50	    n_cols,				#列数
51	    k,					#分解后生成矩阵的维度
52	    regularization=reg,	#在训练过程中使用的正则化权重
53	    unobserved_weight=0)
54	
55	row_factors = tf.nn.embedding_lookup(					#从模型中取出行矩阵
56	    model.row_factors,
57	    tf.range(model._input_rows),
58	    partition_strategy="div")
59	col_factors = tf.nn.embedding_lookup(					#从模型中取出列矩阵
60	    model.col_factors,
61	    tf.range(model._input_cols),
62	    partition_strategy="div")
63	#获取稀疏矩阵中原始的行和列的索引
64	row_indices, col_indices = tf.split(P.indices,	
65	                                    axis=1,
66	                                    num_or_size_splits=2)
67	gathered_row_factors = tf.gather(row_factors, row_indices)#根据索引从分解矩阵中取出对应的值
68	gathered_col_factors = tf.gather(col_factors, col_indices)
69	#将行和列相乘，得到预测的评分值
70	approx_vals = tf.squeeze(tf.matmul(gathered_row_factors,	
71	                                   gathered_col_factors,
72	                                   adjoint_b=True))
73	P_approx = tf.SparseTensor(indices=P.indices,		#将预测结果组合成稀疏矩阵
74	                           values=approx_vals,
75	                           dense_shape=P.dense_shape)
76	
77	E = tf.sparse_add(P, P_approx * (-1))				#让两个稀疏矩阵相减
78	E2 = tf.square(E)
79	n_P = P.values.shape[0].value
80	rmse_op = tf.sqrt(tf.sparse_reduce_sum(E2) / n_P)	#计算loss值
81	#定义更新分解矩阵权重的op
82	row_update_op = model.update_row_factors(sp_input=P)[1]	
83	col_update_op = model.update_col_factors(sp_input=P)[1]
84	
85	model.initialize_op.run()
86	model.worker_init.run()
87	for _ in range(n):									#按指定次数迭代训练
88	    
89	    model.row_update_prep_gramian_op.run()			#训练并更新行（用户）矩阵
90	    model.initialize_row_update_op.run()
91	    row_update_op.run()
92	    
93	    model.col_update_prep_gramian_op.run() 			#训练并更新列（电影）矩阵
94	    model.initialize_col_update_op.run()
95	    col_update_op.run()
96	
97	    print('RMSE: {:,.3f}'.format(rmse_op.eval()))	#输出loss值
98	
99	user_factors = model.row_factors[0].eval()
100	item_factors = model.col_factors[0].eval()
101	
102	print('User factors shape:', user_factors.shape)	#输出分解后的矩阵形状
103	print('Item factors shape:', item_factors.shape)

代码运行后，输出以下结果：

RMSE: 1.999
RMSE: 0.791
……
RMSE: 0.538
User factors shape: (671, 10)
Item factors shape: (9066, 10)

输出结果的最后两行代表分解后的矩阵大小。可以看到，用户矩阵变成了(671, 10)，电影矩阵变成了(9066, 10)。

四、代码实现：评估WALS模型

评估模型的具体步骤如下：

（1）找到数据集中评论最多的用户。

（2）从该用户评论中取出最后一次的评论记录。

（3）根据用户和评论记录中的电影，在分解矩阵中取值。

（4）将分解矩阵中的评分与第（2）步评论记录中的评分进行比较，计算出WALS模型的准确度。

具体代码如下：

代码1 电影推荐系统（续）

104	c = collections.Counter(r.user_id for r in ratings)
105	user_id, n_ratings = c.most_common(1)[0]
106	#找出评论最多的用户
107	print('评论最多的用户 {}: {:,d}'.format(user_id, n_ratings)) 
108	
109	r = next(r for r in reversed(ratings) if r.user_id == user_id and r.rating == 5.0)#找一条评论为5的数据
110	print('该用户最后一条5分记录: ',r) 
111	
112	#在预测模型中取值
113	i = users_to_idx[r.user_id]
114	j = items_to_idx[r.item_id]
115	
116	u = user_factors[i]									#取出user矩阵的值
117	print('Factors for user {}:n'.format(r.user_id))
118	print(u)
119	
120	v = item_factors[j]									#取出item矩阵的值
121	print('Factors for item {}:n'.format(r.item_id))
122	print(v)
123	
124	p = np.dot(u, v)										#计算预测结果
125	print('Approx. rating: {:,.3f}, diff={:,.3f}, {:,.3%}'.format(p, r.rating - p, p/r.rating))									#评估结果，输出loss值

代码运行后，输出以下结果：

评论最多的用户547: 2,391
该用户最后一条5分记录: Rating(user_id='547', item_id='163949', rating=5.0, timestamp=1476419239)
Factors for user 547:
 [-0.11183977 -0.09171382 -0.10098672 -0.7796077   0.33030528 -0.03237698
  0.48777038  0.4614259  -0.6705016  -0.4126554 ]
Factors for item 163949:
 [-0.29128832 -0.23886949 -0.263021   -2.0304952   0.8602844  -0.0843261
  1.270403    1.2017884  -1.7463298  -1.0747647 ]
Approx. rating: 4.740, diff=0.260, 94.791%

从输出结果可以看到。WALS模型的准确率为94.791%。

五、代码实现：用WALS模型为用户推荐电影

用WALS模型进行推荐电影的步骤如下：

（1）用WALS模型计算出该用户对所有电影的评分。

（2）从所有的评分中找出该用户在真实数据集中没有评论的电影。

（3）按照预测分值排序，

（4）将分值最大的前10个电影提取出来，推荐给用户。

具体代码如下：

代码1 电影推荐系统（续）

126	#推荐排名
127	V = item_factors
128	user_P = np.dot(V, u)
129	print('预测出用户所有的评分，形状为:', user_P.shape)
130	#该用户评论的电影
131	user_items = set(ur.item_id for ur in ratings if ur.user_id == user_id) 
132	
133	user_ranking_idx = sorted(enumerate(user_P), key=lambda p: p[1], reverse=True)
134	user_ranking_raw = ((items_from_idx[j], p) for j, p in user_ranking_idx)
135	user_ranking = [(item_id, p) for item_id, p in user_ranking_raw if item_id not in user_items]#找到该用户没有评论过的所有电影评分
136	
137	top10 = user_ranking[:10]#取出前10个
138	
139	print('Top 10 items:n')
140	for k, (item_id, p) in enumerate(top10):  #得到该用户喜欢电影的排名
141	    print('[{}] {} {:,.2f}'.format(k+1, item_id, p))

代码运行后，输出以下结果：

预测出用户所有的评分，形状为: (9066,)
Top 10 items:
[1] 1211 6.85
[2] 1273 6.49
……
[9] 2594 5.63
[10] 501 5.53

输出结果的第1行，显示该用户所有的评分数值（对应的9066个电影评分）。

接着，从未评分的电影中找出了10个评分最高的电影。

这些数据将代表用户有可能喜欢的电影，为用户推送过去。

客官您学得怎么样? 要不要来一本这样的书，就是这么实战。

本书特点如下：

weixin_39997310

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫