集体智慧编程(一) Making Recommendations

协同过滤(Collaborative Filtering)是推荐系统常用的算法。它包括User-based和Item-based两种基本算法。

1.基于用户(User-based)

基于用户的CF主要流程是:寻找相似的用户——推荐物品。寻找相似用户时,书中提供了欧几里德距离(Euclidean Distance)和皮尔逊相关系数(Pearson Correlation)两种度量。

1.1先在文件中存一些用户爱好数据。

critics = {'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 'The Night Listener': 3.0},
           'Gene Seymour':{'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener':3.0, 'You, Me and Dupree': 3.5},
           'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0, 'Superman Returns': 3.5, 'The Night Listener':4.0}, 
           'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'Superman Returns': 4.0, 'The Night Listener':4.5, 'You, Me and Dupree': 2.5},
           'Mick LaSalle':{'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener':3.0, 'You, Me and Dupree': 2.0}, 
           'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,'The Night Listener':3.0, 'Superman Returns':5.0, 'You, Me and Dupree': 3.5},
           'Toby':{ 'Snakes on a Plane': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 1.0}}

1.2寻找相似用户

1.2.1欧几里德距离(Euclidean Distance)

欧几里德空间中,两点x,y的距离计算公式:


def sim_distance(prefs, person1, person2):
    si = {}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item] = 1
    if len(si) == 0:
        return 0
    sum_of_squares = sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si])
    return 1/(1+sum_of_squares)

1.2.2皮尔逊相关系数(Pearson Correlation)

计算公式:


def sim_pearson(prefs, person1, person2):
    s = {}
    for item in prefs[person1]:
        if item in prefs[person2]:
            s[item] = 1
    if len(s) == 0: return 0
    
    sum1 = sum([prefs[person1][it] for it in s])
    sum2 = sum([prefs[person2][it] for it in s])
    
    sum1sq = sum([pow(prefs[person1][it], 2) for it in s])
    sum2sq = sum([pow(prefs[person2][it], 2) for it in s])
    
    sumofprod = sum([prefs[person1][it]*prefs[person2][it] for it in s])
    
    num = sumofprod - (sum1*sum2/len(s))
    den = sqrt((sum1sq - pow(sum1,2)/len(s))*(sum2sq - pow(sum2,2)/len(s)))
    
    if den == 0:
        return 0
    
    r = num/den
    return r

定义了欧几里德距离和皮尔逊相关两种度量后,可以用这两种度量来寻找相似的用户。

def topMatches(prefs, person, n=5, similarity=sim_pearson):
    scores = [(similarity(prefs, person, other), other) for other in prefs if other!=person]
    scores.sort()
    scores.reverse()
    return scores[0:n]

1.3推荐物品

推荐图品时,用(用户相似系数×物品评分)之和/(用户相似系数之和)。

比如,为Alice推荐电影。Bob与Alice的相似系数是0.6,对电影《Looper》的评分为3;Wendy与Alice的相似系数为0.8,对电影《Looper》的评分为4。则为Alice推荐电影《Looper》的指数为:(0.6*3+0.8*4)/(0.6+0.8)

def getRecommendations(prefs, person, similarity=sim_pearson):
    totals = {}
    simSums = {}
    for other in prefs:
        if other == person:
            continue
        sim = similarity(prefs, person, other)
        if sim<=0:continue
        for item in prefs[other]:
            if item not in prefs[person] or prefs[person][item] == 0:
                totals.setdefault(item,0)
                totals[item] += prefs[other][item]*sim
                simSums.setdefault(item,0)
                simSums[item] += sim
    rankings = [(total/simSums[item],item) for item,total in totals.items()]
    rankings.sort()
    rankings.reverse()
    return rankings

2.基于物品(Item-based)

基于用户的CF有它的局限性,比如:用户数量大的时候,计算用户之间的相似系数,然后又计算用户评价过的物品会比较慢。此外,物品数量很多的情况下,两个用户的交集就会很小,甚至完全没交集,数据集是稀疏的。为了解决这样的问题,可以采用基于物品的CF。对于像亚马逊这样的网站来说,销售的商品是相对比较固定的,可以预先计算物品之间的相似系数,然后根据用户的之前买过的物品进行推荐。

2.1建立物品比较数据集(Building the item comparison dataset)

#将数据集由用户主导转为物品主导
def transformPrefs(prefs):
    result = {}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item,{})
            result[item][person] = prefs[person][item]
    return result

#构建相似物品数据集,默认为前10个相似物品
def calculateSimilarItems(prefs, n=10):
    result = {}
    itemPrefs = transformPrefs(prefs)
    c = 0
    for item in itemPrefs:
        c += 1
        if c % 100 == 0:
            print "%4d / %4d" % (c,len(itemPrefs))
        scores = topMatches(itemPrefs, item, n=n, similarity=sim_distance)
        result[item] = scores
    return result

2.2推荐物品

def getRecommendedItems(prefs, itemMatch, user):
    simSums = {}
    recommend = {}
    for filmwatched in prefs[user]:
        for sim,film in itemMatch[filmwatched]:
            if film in prefs[user]:
                continue
            recommend.setdefault(film,0)
            recommend[film] += prefs[user][filmwatched] * sim
            simSums.setdefault(film,0)
            simSums[film] += sim
    result = [(recommend[item]/simSums[item],item) for item in recommend]
    result.sort()
    result.reverse()
    return result

这样就可以基于物品进行推荐:

>>> import recommendations
>>> prefs = recommendations.calculateSimilarItems(recommendations.critics)
>>> recommendations.getRecommendedItems(recommendations.critics, prefs, 'Toby')
[(3.182634730538922, 'The Night Listener'), (2.5983318700614575, 'Just My Luck'), (2.4730878186968837, 'Lady in the Water')]

3.一个实例——使用MovieLens Dataset

最后用一个电影评分数据集来做一个比较有实际意义的推荐。数据集可以在这里下载MovieLens数据集 里面有电影评分,电影信息,用户信息等各种东西,Readme很详细第介绍了各个文件的内容格式。现在就可以从文件中构建数据库,也就是将数据存成类似上面的critics的形式,{用户:{电影1:评分,电影2:评分}}

def loadMoviLens(path=r'存放的路径\Dataset\ml-100k'):
    film = {}
    filminfo = open(path + r'\u.item')
    for line in filminfo.readlines():
        (id, title) = line.strip('\n').split('|')[0:2]
        film[id] = title
    filminfo.close()
    
    prefs = {}
    ratedata = open(path + r'\u.data')
    for line in ratedata.readlines():
        (user, filmid, rating, ts) = line.strip('\n').split('\t')
        prefs.setdefault(user,{})
        prefs[user][film[filmid]] = float(rating)
    return prefs
建好数据库后,可以把它存起来,方便以后用,不用每次都loadMovieLens

import pickle
import recommendations as recmd
try:
    prefs = recmd.loadMovieLens()
    filmsim = recmd.calculateSimilarItems(prefs, n=50)
    f = open(r'你存放的数据的路径','wb')
    pickle.dump(filmsim, f)
except:
    print 'Something wrong!'
else:
    print 'Done building dataset!'
finally:
    f.close()
然后就可以测试一下了:

>>> from pprint import pprint
>>> import recommendations
>>> prefs = recommendations.loadMoviesLen()
>>> pprint(prefs['87'])
{'2001: A Space Odyssey (1968)': 5.0,
 'Ace Ventura: Pet Detective (1994)': 4.0,
 'Addams Family Values (1993)': 2.0,
 'Addicted to Love (1997)': 4.0,
.....}
>>> pprint(recommendations.getRecommendations(prefs,'87')[0:10])
[(5.0, 'They Made Me a Criminal (1939)'),
 (5.0, 'Star Kid (1997)'),
 (5.0, 'Santa with Muscles (1996)'),
 (5.0, 'Saint of Fort Washington, The (1993)'),
 (5.0, 'Marlene Dietrich: Shadow and Light (1996) '),
 (5.0, 'Great Day in Harlem, A (1994)'),
 (5.0, 'Entertaining Angels: The Dorothy Day Story (1996)'),
 (5.0, 'Boys, Les (1997)'),
 (4.89884443128923, 'Legal Deceit (1997)'),
 (4.815019082242709, 'Letter From Death Row, A (1998)')]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值