推荐系统得分与SVD提升性能

本文基于《机器学习实战》第14章内容


1.      推荐系统得分计算原理。

 

       首先,说一下推荐系统。因为SVD的其中一个应用就是推荐系统。

       简单版本的推荐系统能够计算项或者人之间的相似度。更先进的方法则先利用SVD从数据中构建一个主题空间,然后在该空间下计算其相似度。

       这里从食物推荐介绍。推荐的目的是希望为用户寻找没有尝过又好吃的菜肴,即根据已有信息,利用相似度计算出用户没有尝过的菜肴得分,再根据得分高的菜肴推荐给用户。

       计算方法很简单,首先我们有一个用户与菜肴组成的矩阵,11X11,如下图。

       举个例子,计算第三个用户Drew没有尝过的菜肴的得分。可以看到Drew在麻婆豆腐鱼印度奶酪咖喱是有得分的(1分和4分),所以我们要算剩下9个的得分。算法原理是:

(1)      找到Drew行所有为0的项与非0项。下图中绿色那个代表其中一个0项,黄色代表非0项(1和4)。这里只计算下图绿色的0项,其他类似计算。

(2)      绿色0所在的列与黄色1所在的列做交集,找出两列的非零项,然后计算相似度得到a1(相似度方法如相关系数);类似的计算绿色0与黄色4的相似度得到a2。

(3)      绿色0的得分等于(a1*1+a2*4)/(a1+a2) , 这里除以(a1+a2)为了使得分控制在0到5之间。

(4)      其他的点计算类似。


代码如下,所用的数据矩阵与上图不同,但是计算方法一样

from numpy import *
from numpy import linalg as la

    
def loadExData2():
    return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
           [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
           [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
           [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
           [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
           [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
           [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
           [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
           [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
           [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
           [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]

def cosSim(inA,inB):
    num = float(inA.T*inB)
    denom = la.norm(inA)*la.norm(inB)
    return 0.5+0.5*(num/denom)

def standEst(dataMat, user, simMeas, item):
    n = shape(dataMat)[1]
    simTotal = 0.0; ratSimTotal = 0.0
    for j in range(n):
        userRating = dataMat[user,j]
        if userRating == 0: continue
        overLap = nonzero(logical_and(dataMat[:,item].A>0, \
                                      dataMat[:,j].A>0))[0]
        if len(overLap) == 0: similarity = 0
        else: similarity = simMeas(dataMat[overLap,item], \
                                   dataMat[overLap,j])
        print ('the %d and %d similarity is: %f' % (item, j, similarity))
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: return 0
    else: return ratSimTotal/simTotal

def recommend(dataMat, user, N=3, simMeas=cosSim, estMethod=standEst):
    unratedItems = nonzero(dataMat[user,:].A==0)[1]#find unrated items 
    if len(unratedItems) == 0: return 'you rated everything'
    itemScores = []
    for item in unratedItems:
        estimatedScore = estMethod(dataMat, user, simMeas, item)
        itemScores.append((item, estimatedScore))
    return sorted(itemScores, key=lambda jj: jj[1], reverse=True)[:N]


计算第三行的所有0项得分:

my_dat = mat(loadExData2())
recommend(my_dat,2,N=10)


the 0 and 4 similarity is: 0.000000
the 0 and 7 similarity is: 0.990916
the 0 and 9 similarity is: 0.000000
the 1 and 4 similarity is: 0.000000
the 1 and 7 similarity is: 0.978429
the 1 and 9 similarity is: 0.000000
the 2 and 4 similarity is: 0.000000
the 2 and 7 similarity is: 0.977652
the 2 and 9 similarity is: 0.000000
the 3 and 4 similarity is: 0.000000
the 3 and 7 similarity is: 0.000000
the 3 and 9 similarity is: 1.000000
the 5 and 4 similarity is: 0.000000
the 5 and 7 similarity is: 0.000000
the 5 and 9 similarity is: 1.000000
the 6 and 4 similarity is: 1.000000
the 6 and 7 similarity is: 0.000000
the 6 and 9 similarity is: 0.692308
the 8 and 4 similarity is: 0.000000
the 8 and 7 similarity is: 0.995750
the 8 and 9 similarity is: 0.000000
the 10 and 4 similarity is: 0.000000
the 10 and 7 similarity is: 1.000000
the 10 and 9 similarity is: 1.000000
Out[133]: 
[(3, 4.0),
 (5, 4.0),
 (6, 4.0),
 (10, 2.5),
 (0, 1.0),
 (1, 1.0),
 (2, 1.0),
 (8, 1.0)]


2.     利用SVD提高推荐的效果

xformedItems = dataMat.T * U[:,:4] * Sig4.I , 课本代码只用了这句,下面解析一下:





用上面的数据验证一下这个公式:

from numpy import *
from numpy import linalg as la

    
def loadExData2():
    return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
           [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
           [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
           [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
           [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
           [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
           [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
           [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
           [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
           [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
           [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]

my_dat = mat(loadExData2())
U,Sigma,VT = la.svd(my_dat)
Sig4 = mat(eye(4)*Sigma[:4]) #arrange Sig4 into a diagonal matrix
xformedItems = my_dat.T * U[:,:4] * Sig4.I  #create transformed items
print('xformedItems: \n',xformedItems)
###
V = VT.T
print('V[:,:4]: \n',V[:,:4])

结果一样,perfect!

xformedItems: 
 [[-0.45137416  0.03084799 -0.00290108  0.01189185]
 [-0.36239706  0.02584428 -0.00189127  0.01348796]
 [-0.46879252  0.03296133 -0.00281253  0.01656192]
 ..., 
 [-0.47223188  0.02853952 -0.00504059  0.00160266]
 [-0.01591788 -0.39205093  0.55707516  0.04356321]
 [-0.0552444  -0.52034959 -0.36330956 -0.19023805]]
V[:,:4]: 
 [[-0.45137416  0.03084799 -0.00290108  0.01189185]
 [-0.36239706  0.02584428 -0.00189127  0.01348796]
 [-0.46879252  0.03296133 -0.00281253  0.01656192]
 ..., 
 [-0.47223188  0.02853952 -0.00504059  0.00160266]
 [-0.01591788 -0.39205093  0.55707516  0.04356321]
 [-0.0552444  -0.52034959 -0.36330956 -0.19023805]]


SVD提高推荐的效果

from numpy import *
from numpy import linalg as la

    
def loadExData2():
    return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
           [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
           [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
           [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
           [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
           [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
           [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
           [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
           [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
           [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
           [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]

def cosSim(inA,inB):
    num = float(inA.T*inB)
    denom = la.norm(inA)*la.norm(inB)
    return 0.5+0.5*(num/denom)

def svdEst(dataMat, user, simMeas, item):
    n = shape(dataMat)[1]
    simTotal = 0.0; ratSimTotal = 0.0
    U,Sigma,VT = la.svd(dataMat)
    Sig4 = mat(eye(4)*Sigma[:4]) #arrange Sig4 into a diagonal matrix
    xformedItems = dataMat.T * U[:,:4] * Sig4.I  #create transformed items
    for j in range(n):
        userRating = dataMat[user,j]
        if userRating == 0 or j==item: continue
        similarity = simMeas(xformedItems[item,:].T,\
                             xformedItems[j,:].T)
        print ('the %d and %d similarity is: %f' % (item, j, similarity))
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: return 0
    else: return ratSimTotal/simTotal

def recommend(dataMat, user, N=3, simMeas=cosSim, estMethod=svdEst):
    unratedItems = nonzero(dataMat[user,:].A==0)[1]#find unrated items 
    if len(unratedItems) == 0: return 'you rated everything'
    itemScores = []
    for item in unratedItems:
        estimatedScore = estMethod(dataMat, user, simMeas, item)
        itemScores.append((item, estimatedScore))
    return sorted(itemScores, key=lambda jj: jj[1], reverse=True)[:N]

my_dat = mat(loadExData2())
recommend(my_dat,2,N=10)

结果:
the 0 and 4 similarity is: 0.487100
the 0 and 7 similarity is: 0.996341
the 0 and 9 similarity is: 0.490280
the 1 and 4 similarity is: 0.485583
the 1 and 7 similarity is: 0.995886
the 1 and 9 similarity is: 0.490272
the 2 and 4 similarity is: 0.485739
the 2 and 7 similarity is: 0.995963
the 2 and 9 similarity is: 0.490180
the 3 and 4 similarity is: 0.450495
the 3 and 7 similarity is: 0.482175
the 3 and 9 similarity is: 0.522379
the 5 and 4 similarity is: 0.506795
the 5 and 7 similarity is: 0.494716
the 5 and 9 similarity is: 0.496130
the 6 and 4 similarity is: 0.434401
the 6 and 7 similarity is: 0.479543
the 6 and 9 similarity is: 0.583833
the 8 and 4 similarity is: 0.490037
the 8 and 7 similarity is: 0.997067
the 8 and 9 similarity is: 0.490078
the 10 and 4 similarity is: 0.512896
the 10 and 7 similarity is: 0.524970
the 10 and 9 similarity is: 0.493617
Out[152]: 
[(6, 3.0394902391812897),
 (5, 3.0090087051508885),
 (3, 3.0058579857590897),
 (10, 2.9716430866680414),
 (8, 2.4871398120349713),
 (0, 2.48559019295775),
 (1, 2.4847617140703271),
 (2, 2.4847530913678271)]

结果与没有用SVD相比,有一点差异,主要是因为这里的用户量不多,交集得到的非0项少,计算相似度的数据偏少导致。


打完收工!


--------------------------------------------------------------------------------------
(公式导入要用latex比较麻烦,所有上面2部分用图片,望见谅)




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值