《集体智慧编程》第二章(一)

一、计算用户相似度
1.欧几里得距离
为了方便以后的读者学习,代码(基于python2.6)全部在最后。
这个没什么好说的,在二维空间中就是两点之间线段的长度。多维空间中,例如A(x1,x2,x3,…,xn)和B(y1,y2,y3,…,yn),它们的欧几里得距离计算公式为
公式1
对应代码:

sum_of_squares = sum([pow(prefs[person1][item] - prefs[person2][item], 2) for item in prefs[person1] if item in prefs[person2]])

后边要将其归一化处理,即1除以距离加1,加一是防止分母为0
对应代码:

return 1/(1 + sqrt(sum_of_squares))

2.皮尔逊相关系数
欧几里得距离是以物品为轴,计算人物之间的距离;皮尔逊相关系数则是以人物为轴,根据对物品的评分结果相似性计算任务相似性。根据书中的图可以很好理解。
皮尔逊相关系数的计算公式为:
公式2
其中SI=X∩Y,N=len(SI)
看着很庞大,但实际上,学过概率论的童鞋就不会陌生,
公式3
对应代码:

    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])

    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])

    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])

    #calculate the Pearson Correlation Score
    num = pSum - (sum1 * sum2/n)
    den = sqrt((sum1Sq - pow(sum1, 2)/n)*(sum2Sq - pow(sum2, 2)/n))
    if den == 0: return 0
    r = num / den

二、推荐
1.计算与用户最相似的几个用户,即书中的为评论者打分
这个没啥好说的,上文已经计算出了相似度,直接进行下排序就可以了,代码直接看最后吧。
2.给用户推荐物品
大致分为两步:计算用户可能给物品打多少分;根据打分结果排序,输出前几个结果。
重点在第一步,即计算用户可能给物品打多少分,这里映入了权重的概念。越是与用户相似的用户的物品,权重越大。也即我们直接把相似度作为权值赋给每个用户。所以我们计算用户可能的分数就可以大致概括为如下公式:
公式4
书中的表2-2就是计算过程及结果的一个展示(图我就不贴了),对应代码如下

                totals.setdefault(item, 0)
                totals[item] += prefs[other][item] * sim
                #sum of similarity
                simSums.setdefault(item, 0)
                simSums[item] += sim

下面就可以根据计算出的结果进行排序了。
附录:目前为止recommendations.py文件中的代码如下
请大家忽略我的注释,我的英文不好,正在努力多用英文,有什么语法错误大家可以随时指正,共勉!

#create a dict about movies
critics = {'Lisa Rose':{'Lady in the Water':2.5, 'Snakes on a Plane':3.5, 'Just My Luck':3.0, 'Superman Returns':3.5, 'You, Me and Dupree':2.5, 'The Night Listener':3.0},
       'Gene Seymour':{'Lady in the Water':3.0, 'Snakes on a Plane':3.5, 'Just My Luck':1.5, 'Superman Returns':5.0, 'The Night Listener':3.0, 'You, Me and Dupree':3.5},
       'Michael Phillips':{'Lady in the Water':2.5, 'Snakes on a Plane':3.0, 'Superman Returns':3.5, 'The Night Listener':4.0},
       'Claudia Puig':{'Snakes on a Plane':3.5, 'Just My Luck':3.0, 'The Night Listener':4.5, 'Superman Returns':4.0, 'You, Me and Dupree':2.5},
       'Mick LaSalle':{'Lady in the Water':3.0, 'Snakes on a Plane':4.0, 'Just My Luck':2.0, 'Superman Returns':3.0, 'The Night Listener':3.0, 'You, Me and Dupree':2.0},
       'Jack Matthews':{'Lady in the Water':3.0, 'Snakes on a Plane':4.0, 'The Night Listener':3.0, 'Superman Returns':5.0, 'You, Me and Dupree':3.5},
       'Toby':{'Snakes on a Plane':4.5, 'You, Me and Dupree':1.0, 'Superman Returns':4.0}}

from math import sqrt
#return a value to judge similarity between person1 and person2
def sim_distance(prefs, person1, person2):
    #items of both person1 and person2
    si = {}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item] = 1
    #return 0 if there is no item both of person1 and person2
    if len(si) == 0: return 0
    #calculate the distance between person1 and person2
    sum_of_squares = sum([pow(prefs[person1][item] - prefs[person2][item], 2) for item in prefs[person1] if item in prefs[person2]])
    return 1/(1 + sqrt(sum_of_squares))

#return the Pearson Correlation Score between p1 and p2
def sim_pearson(prefs, p1, p2):
    #items of both p1 and p2
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:
            si[item] = 1
    #number of items
    n = len(si)
    #return 0 if there is no item both of p1 and p2
    if n == 0: return 1

    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])

    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])

    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])

    #calculate the Pearson Correlation Score
    num = pSum - (sum1 * sum2/n)
    den = sqrt((sum1Sq - pow(sum1, 2)/n)*(sum2Sq - pow(sum2, 2)/n))
    if den == 0: return 0
    r = num / den
    return r

#return the most similarity person
def topMatches(prefs, person, n = 5, similarity = sim_pearson):
    scores = [(similarity(prefs, person, other), other) for other in prefs if other != person]
    #sorted the list
    scores.sort()
    scores.reverse()
    return scores[0:n]

#give the suggest by other score of add power
def getRecommendations(prefs, person, similarity = sim_pearson):
    totals = {}
    simSums = {}
    for other in prefs:
        #do not match with itself
        if other == person: continue
        sim = similarity(prefs, person, other)
        #ignore values equles zero or less than zero
        if sim <= 0: continue
        for item in prefs[other]:
            #only assess movies which himself not yet watch
            if item not in prefs[person] or prefs[person][item] == 0:
                #similarity * values
                totals.setdefault(item, 0)
                totals[item] += prefs[other][item] * sim
                #sum of similarity
                simSums.setdefault(item, 0)
                simSums[item] += sim
    #create a normalized list 
    rankings = [(total/simSums[item], item) for item, total in totals.items()]
    #sorted
    rankings.sort()
    rankings.reverse()
    return rankings
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值