1、协同过滤则是推荐系统中较为广泛使用的推荐技术之一,其实质是对用户的历史评分矩阵数据进行建模分析从而为用户推荐合适的产品。
协同过滤在学术界和工业界已经得到了广泛的研究并提出了很多算法。其中比较常见的有基于最近邻方法(包括基于用户最近邻和基于项目最近邻的方法)、Slope One、隐因子模型(主要包括受限玻尔兹曼机模型和矩阵分解技术)、贝叶斯模型、聚类技术和决策树方法等等。其中最常用也最有效的方法是基于最近邻方法和隐因子模型。基于用户最近邻的协同过滤算法主要给用户推荐与其兴趣相似的其他用户喜欢而该用户又没有评分的商品,即为“人以群分”。基于项目最近邻的协同过滤算法则是给用户推荐和他之前喜欢的物品相似的物品,即为“物以类聚”。矩阵分解技术的数学基础是矩阵的行列变换,其目标就是把用户-项目评分矩阵R分解成用户因子矩阵和项目因子矩阵乘的形式,即R=UV,进而分解成为两个低维的矩阵。以下内容将重点介绍推荐系统中的基于用户最近邻推荐算法UserCF。
2、Pearson系数:其是一个介于-1和1之间的值,用来描述两组线性的数据一同变化移动的趋势。用数学公式表示,等于两个变量的协方差除于两个变量的标准差。如下:
3、具体以Python语言来实现,运行的数据集是Movielens中的u1.base作为训练集,u1.test作为测试集合,推荐评价指标采用的是MAE、RMSE。代码如下:
import codecs
import time
from math import sqrt
import math
class recommender:
def __init__(self, k=1, metric=''):
self.k = k
self.metric = metric
if self.metric == 'pearson':
self.fn = self.pearson
def loadMovieLens(self, path='F:\Programmer\'s Guide to Data Mining\movelenseData\ml-100k\\'):
self.data = {}
self.datatest = {}
i = 0
j = 0
f = codecs.open(path + "u1.base", 'r', 'ascii')
for line in f:
i += 1
fields = line.split('\t')
user = fields[0]
movie = fields[1]
rating = int(fields[2].strip().strip('"'))
if user in self.data:
currentRatings = self.data[user]
else:
currentRatings = {}
currentRatings[movie] = rating
self.data[user] = currentRatings
f.close()
f1 = codecs.open(path + "u1.test", 'r', 'ascii')
for line1 in f1:
j += 1
fields1 = line1.split('\t')
user1 = fields1[0]
movie1 = fields1[1]
rating1 = int(fields1[2].strip().strip('"'))
if user1 in self.datatest:
currentRatings1 = self.datatest[user1]
else:
currentRatings1 = {}
currentRatings1[movie1] = rating1
self.datatest[user1] = currentRatings1
f1.close()
print len(self.datatest)
print self.datatest["458"]
def pearson(self, rating1, rating2):
sum_xy = 0
sum_x = 0
sum_y = 0
sum_x2 = 0
sum_y2 = 0
n = 0
for key in rating1:
if key in rating2:
n += 1
x = rating1[key]
y = rating2[key]
sum_xy += x * y
sum_x += x
sum_y += y
sum_x2 += pow(x, 2)
sum_y2 += pow(y, 2)
if n == 0:
return 0
denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
* sqrt(sum_y2 - pow(sum_y, 2) / n))
if denominator == 0:
return 0
else:
return (sum_xy - (sum_x * sum_y) / n) / denominator
def computeNearestNeighbor(self, username):
distances = []
for instance in self.data:
if instance != username:
distance = self.fn(self.data[username],
self.data[instance])
if distance <= 0: continue
distances.append((instance, distance))
distances.sort(key=lambda artistTuple: artistTuple[1],
reverse=True)
return distances
def min(x,y):
if x >= y:
z = y
else:
z = x
def getAverangeRating(self, user):
averange = sum(self.data[user].values())/len(self.data[user].keys())
return averange
def validation(self):
i = 0
pred = {}
for user,userRatings in self.datatest.items():
print "USER::",user
i +=1
pred.setdefault(user, {})
avg_u_rating = self.getAverangeRating(user)
neighbor = self.computeNearestNeighbor(user)
z = min(len(neighbor), self.k)
k = 0
for m in userRatings.keys():
for i in range(z):
user1 = neighbor[i][0]
if m not in self.data[user1]:
k +=1
if k == z:
pred[user][m] = avg_u_rating
else:
neighRating = 0
neighSimSum = 0
for i in range(z):
name = neighbor[i][0]
print "name", name
nuserAverage = self.getAverangeRating(name)
neighborRatings = self.data[name]
if m not in neighborRatings: continue
neighRating += (neighborRatings[m] - nuserAverage) * neighbor[i][1]
neighSimSum += abs(neighbor[i][1])
if neighSimSum == 0:
pred[user][m] = avg_u_rating
else:
pred[user][m] = avg_u_rating + neighRating / neighSimSum
mae = 0.0
rmse = 0.0
erro_sum = 0.0
sqrError_sum = 0.0
setSum = 0
for user, item in pred.items():
for m in item.keys():
erro_sum += abs(pred[user][m] - self.datatest[user][m])
sqrError_sum += (pred[user][m] - self.datatest[user][m]) ** 2
setSum += 1
mae = erro_sum / setSum
rmse = sqrt(sqrError_sum / setSum)
return mae, rmse
start = time.clock()
r = recommender(20, 'pearson')
r.loadMovieLens()
mae,rmse = r.validation()
print "MAE:", mae
print "RMSE:", rmse
end = time.clock()
print "Total times: %f s" % (end - start)
4、运行结果:
5、总结
1、可见精确度不高,一方面原因在于只是用Python实现了最简单的UserCF算法,还有很大提升的空间。另一方面是因为参数没有进行调优。
其他推荐算法持续更新中。
6、参考文献:Programmer's Guide to Data Mining