programming collective intelligence读书笔记1

最新推荐文章于 2018-02-08 09:48:15 发布

伶仃独步

最新推荐文章于 2018-02-08 09:48:15 发布

阅读量334

点赞数

分类专栏： programming collective intelli 文章标签：读书笔记

本文链接：https://blog.csdn.net/u013319133/article/details/51000578

版权

programming collective intelli 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

programming collective intelligence读书笔记1

首先新建一个文件recommendations.py，存放下面的数据集

critics = {'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
                                     'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
                                     'The Night Listener': 3.0},
                       'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
                                        'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
                                        'You, Me and Dupree': 3.5},
                       'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
                                            'Superman Returns': 3.5, 'The Night Listener': 4.0},
                       'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
                                        'The Night Listener': 4.5, 'Superman Returns': 4.0,
                                        'You, Me and Dupree': 2.5},
                       'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
                                        'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
                                        'You, Me and Dupree': 2.0},
                       'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
                                         'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
                       'Toby': {'Snakes on a Plane': 4.5, 'You, Me and Dupree': 1.0, 'Superman Returns': 4.0}}

计算相似程度的函数，使用 Euclidean distance score

    from math import sqrt
                # Returns a distance-based similarity score for person1 and person2
    def sim_distance(prefs,person1,person2):
                # Get the list of shared_items
            si={}
            for item in prefs[person1]:
                if item in prefs[person2]:
                        si[item]=1
                # if they have no ratings in common, return 0
            if len(si)==0: return 0
                # Add up the squares of all the differences
            sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
            for item in prefs[person1] if item in prefs[person2]])
            return 1/(1+sum_of_squares)

计算相似程度的函数，使用 Pearson correlation coefficient。 Pearson correlation coefficient用来表示两套data set在同一条直线上的拟合程度，其计算公式比Euclidean distance 要复杂，但当数据摆动较大时能给出更好的结果。举个例子， Mick LaSalle对superman的评分是3，Gene Seymour的评分是5，于是在二位坐标系上得到了一个点（3，5）。

作出评论的散点图

在这个图表中你可以看到一条直线，有意思的是，使用 Pearson score对评分有矫正作用——当一个用户喜欢打高分而另一位用户倾向于给低分时，如果他们对电影评分的分数之差一致，Pearson score仍然可以和直线进行拟合，而Euclidean distance score会得出这两个用户不相似的结论，即便实际上他们的品味很接近。是否需要这种行为取决于具体的应用。

Pearson score首先找到两个用户都打过分的item，然后计算用户对item的评分之和以及平方和，并计算对item评分乘积的和，最后计算出Pearson correlation coefficient。

    # Returns the Pearson correlation coefficient for p1 and p2
    def sim_pearson(prefs,p1,p2):
    # Get the list of mutually rated items
        si={}
        for item in prefs[p1]:
            if item in prefs[p2]: 
                si[item]=1
            # Find the number of elements
        n=len(si)
    # if they are no ratings in common, return 0
        if n==0: return 0
    # Add up all the preferences
        sum1=sum([prefs[p1][it] for it in si])
        sum2=sum([prefs[p2][it] for it in si])
    # Sum up the squares
        sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
        sum2Sq=sum([pow(prefs[p2][it],2) for it in si])
    # Sum up the products
        pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
    # Calculate Pearson score
        num=pSum-(sum1*sum2/n)
        den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
        if den==0: return 0
        r=num/den
        return r

这个函数的返回值在-1~1之间。

根据相似函数计算出其他用户和某一用户的相似程度，并排序，返回前n个用户

 # Returns the best matches for person from the prefs dictionary.
    # Number of results and similarity function are optional params.
    def topMatches(prefs,person,n=5,similarity=sim_pearson):
        scores=[(similarity(prefs,person,other),other)
        for other in prefs if other!=person]
    # Sort the list so the highest scores appear at the top
        scores.sort()
        scores.reverse()
        return scores[0:n]