使用Python MrJob的MapReduce实现电影推荐系统

最新推荐文章于 2024-08-08 17:04:25 发布

yinyao1992

最新推荐文章于 2024-08-08 17:04:25 发布

阅读量1.6k

点赞数

原文链接：http://www.sobuhu.com/archives/567

最近发现一个很好玩的Python库，可以方便的使用在Python下编写MapReduce任务，直接使用Hadoop Streaming在Hadoop上跑。对于一般的Hadoop而言，如果任务需要大量的IO相关操作（如数据库查询、文件读写等），使用Python还是Java、C++，性能差别不大，而如果需要大量的数据运算，那可能Python会慢很多（语言级别上的慢），参考这里。

最常见的如日志分析、Query统计等，都可以直接用Python快速完成。

Python作为一种快速开发语言，优美、简洁的语法征服了很多人，现在很多的机器学习程序最初都是跑在Python上的（如知乎的推荐引擎），只有当规模大到一定程度才会转移到C或Java上。

本文会通过一个简单的电影推荐系统来介绍如何使用MrJOB。

首先，可能很多人对性能格外在意，可以先看这篇文章：

http://stackoverflow.com/questions/1482282/java-vs-python-on-hadoop

MrJob项目地址： https://github.com/Yelp/mrjob

MrJOB的精简介绍

这里重点在于实现电影推荐的系统，所以对于MrJob本身的介绍会比较简略，够用即可，详细说明可以看官方文档。

首先，在Python中安装mrjob后，最基本的MapReduce任务很简单：

 
   Python 
  
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
        from 
          
        mrjob 
        . 
        job  
        import 
          
        MRJob 
       
        import 
          
        re 
       
        WORD_RE 
          
        = 
          
        re 
        . 
        compile 
        ( 
        r 
        "[\w']+" 
        ) 
       
        class 
          
        MRWordFreqCount 
        ( 
        MRJob 
        ) 
        : 
       
        def 
          
        mapper 
        ( 
        self 
        , 
          
        _ 
        , 
          
        line 
        ) 
        : 
       
        for 
          
        word  
        in 
          
        WORD_RE 
        . 
        findall 
        ( 
        line 
        ) 
        : 
       
        yield 
          
        word 
        . 
        lower 
        ( 
        ) 
        , 
          
        1 
       
        def 
          
        combiner 
        ( 
        self 
        , 
          
        word 
        , 
          
        counts 
        ) 
        : 
       
        yield 
          
        word 
        , 
          
        sum 
        ( 
        counts 
        ) 
       
        def 
          
        reducer 
        ( 
        self 
        , 
          
        word 
        , 
          
        counts 
        ) 
        : 
       
        yield 
          
        word 
        , 
          
        sum 
        ( 
        counts 
        ) 
       
        if 
          
        __name__ 
          
        == 
          
        '__main__' 
        : 
       
        MRWordFreqCount 
        . 
        run 
        ( 
        )

上面的代码中，有三个函数，mapper、combiner、reducer，作用和普通的Java版本相同：

mapper用来接收每一行的数据输入，对其进行处理返回一个key-value对；
combiner接收mapper输出的key-value对进行整合，把相同key的value作为数组输入处理后输出；
reducer和combiner的作用完全相同，不同之处在于combiner是对于单个mapper进行处理，而reducer是对整个任务（可能有很多mapper在执行）的key-value进行处理。它以各个combiner的输出作为输入。

更为详细的介绍，如分步任务、数据初始化等可以参考其这份官方文档。

电影推荐系统

假设我们现在有一个影视网站，每一个用户可以给电影评1到5分，现在我们需要计算每两个电影之间的相似度，其过程是：

对于任一电影A和B，我们能找出所有同时为A和B评分过的人；
根据这些人的评分，构建一个基于电影A的向量和一个基于电影B的向量；
根据这两个向量计算他们之间的相似度；
当有用户看过一部电影之后，我们给他推荐与之相似度最高的另一部电影；

你可以从这里下载一些开源的电影评分数据，我们使用的是1000个用户对1700部电影进行的100000万个评分数据，下载后的数据文件夹包含一个README，里面有对各个文件的详细介绍，鉴于我们只需要（user|movie|rating）数据，所以我们用Python把这些数据进行一些处理：

 
   Python 
  
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
        #!/usr/python/env python 
       
        if 
          
        __name__ 
          
        == 
          
        '__main__' 
        : 
       
        user_items 
          
        = 
          
        [ 
        ] 
       
        items 
          
        = 
          
        [ 
        ] 
       
        with 
          
        open 
        ( 
        'u.data' 
        ) 
          
        as 
          
        f 
        : 
       
        for 
          
        line  
        in 
          
        f 
        : 
       
        user_items 
        . 
        append 
        ( 
        line 
        . 
        split 
        ( 
        '\t' 
        ) 
        ) 
       
        with 
          
        open 
        ( 
        'u.item' 
        ) 
          
        as 
          
        f 
        : 
       
        for 
          
        line  
        in 
          
        f 
        : 
       
        items 
        . 
        append 
        ( 
        line 
        . 
        split 
        ( 
        '|' 
        ) 
        ) 
       
        print 
          
        'user_items[0] = ' 
        , 
          
        user_items 
        [ 
        0 
        ] 
       
        print 
          
        'items[0] = ' 
        , 
          
        items 
        [ 
        0 
        ] 
       
        items_hash 
          
        = 
          
        { 
        } 
       
        for 
          
        i 
          
        in 
          
        items 
        : 
       
        items_hash 
        [ 
        i 
        [ 
        0 
        ] 
        ] 
          
        = 
          
        i 
        [ 
        1 
        ] 
       
        print 
          
        'items_hash[1] = ' 
        , 
          
        items_hash 
        [ 
        '1' 
        ] 
       
        for 
          
        ui  
        in 
          
        user_items 
        : 
       
        ui 
        [ 
        1 
        ] 
          
        = 
          
        items_hash 
        [ 
        ui 
        [ 
        1 
        ] 
        ] 
       
        print 
          
        'user_items[0] = ' 
        , 
          
        user_items 
        [ 
        0 
        ] 
       
        with 
          
        open 
        ( 
        'ratings.csv' 
        , 
        'w' 
        ) 
          
        as 
          
        f 
        : 
       
        for 
          
        ui  
        in 
          
        user_items 
        : 
       
        f 
        . 
        write 
        ( 
        ui 
        [ 
        0 
        ] 
          
        + 
          
        '|' 
          
        + 
          
        ui 
        [ 
        1 
        ] 
          
        + 
          
        '|' 
          
        + 
          
        ui 
        [ 
        2 
        ] 
          
        + 
          
        '\n' 
        )

处理后的数据类大约似于这样：

 
   YAML 
  
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         196 
        |Kolya 
          
        (1996 
        ) 
        |3 
       
         186 
        |L 
        .A 
        . 
         Confidential 
          
        (1997 
        ) 
        |3 
       
         22 
        |Heavyweights 
          
        (1994 
        ) 
        |1 
       
         244 
        |Legends 
         of 
         the 
         Fall 
          
        (1994 
        ) 
        |2 
       
         166 
        |Jackie 
         Brown 
          
        (1997 
        ) 
        |1 
       
         298 
        |Dr 
        . 
         Strangelove 
          
        or 
        : How I Learned to Stop Worrying and Love the Bomb (1963)|4 
       
         115 
        |Hunt 
         for 
         Red 
         October 
        , 
         The 
          
        (1990 
        ) 
        |2 
       
         253 
        |Jungle 
         Book 
        , 
         The 
          
        (1994 
        ) 
        |5 
       
         305 
        |Grease 
          
        (1978 
        ) 
        |3

皮尔逊相关系数

判断两个向量的相似度的方式有很多种，比如测量其欧氏距离、海明距离等，这里我们用皮尔逊相关系数来计算器相关性，该系数可以理解为两个向量之间夹角的余弦值，介于-1到1之间，绝对值越大相关性越强，公式为：

$Corr(X,Y) = \frac{n \sum xy - \sum x \sum y} { \sqrt{n \sum x^{2} - (\sum x)^2 } \sqrt{n\sum y^2 - (\sum y)^2 } }$

第一步，我们首先对把每个用户的所有评分聚合到一起，代码如下：

 
   Python 
  
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
         34 
       
         35 
       
         36 
       
         37 
       
         38 
       
         39 
       
         40 
       
         41 
       
        #!/usr/bin/env python 
       
        # coding=utf-8 
       
        from 
          
        mrjob 
        . 
        job  
        import 
          
        MRJob 
       
        class 
          
        Step1 
        ( 
        MRJob 
        ) 
        : 
       
        """  
       
            第一步是聚合单个用户的下的所有评分数据 
       
            格式为：user_id, (item_count, rating_sum, [(item_id,rating)...]) 
       
            """ 
       
        def 
          
        group_by_user_rating 
        ( 
        self 
        , 
          
        key 
        , 
          
        line 
        ) 
        : 
       
        """ 
       
                该mapper输出为： 
       
                17 70,3 
       
                35 21,1 
       
                49 19,2 
       
                49 21,1 
       
                """ 
       
        user_id 
        , 
          
        item_id 
        , 
          
        rating 
          
        = 
          
        line 
        . 
        split 
        ( 
        '|' 
        ) 
       
        yield 
          
        user_id 
        , 
          
        ( 
        item_id 
        , 
          
        float 
        ( 
        rating 
        ) 
        ) 
       
        def 
          
        count_ratings_users_freq 
        ( 
        self 
        , 
          
        user_id 
        , 
          
        values 
        ) 
        : 
       
        """ 
       
                该reducer输出为： 
       
                49 (3,7,[19,2 21,1 70,4]) 
       
                """ 
       
        item_count 
          
        = 
          
        0 
       
        item_sum 
          
        = 
          
        0 
       
        final 
          
        = 
          
        [ 
        ] 
       
        for 
          
        item_id 
        , 
          
        rating  
        in 
          
        values 
        : 
       
        item_count 
          
        + 
        = 
          
        1 
       
        item_sum 
          
        + 
        = 
          
        rating 
       
        final 
        . 
        append 
        ( 
        ( 
        item_id 
        , 
          
        rating 
        ) 
        ) 
       
        yield 
          
        user_id 
        , 
          
        ( 
        item_count 
        , 
          
        item_sum 
        , 
          
        final 
        ) 
       
        def 
          
        steps 
        ( 
        self 
        ) 
        : 
       
        return 
          
        [ 
        self 
        . 
        mr 
        ( 
        mapper 
        = 
        self 
        . 
        group_by_user_rating 
        , 
       
        reducer 
        = 
        self 
        . 
        count_ratings_users_freq 
        ) 
        , 
        ] 
       
        if 
          
        __name__ 
          
        == 
          
        '__main__' 
        : 
       
        Step1 
        . 
        run 
        ( 
        )

使用命令 $python step1.py ratings.csv > result1.csv 获得第一步的结果。

第二步，根据第一步聚合起来的用户评分，按照皮尔逊系数算法获得任一两个电影之间的相关性，代码及注释如下：

 
   Python 
  
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
         34 
       
         35 
       
         36 
       
         37 
       
         38 
       
         39 
       
         40 
       
         41 
       
         42 
       
         43 
       
         44 
       
         45 
       
         46 
       
         47 
       
         48 
       
         49 
       
         50 
       
         51 
       
         52 
       
         53 
       
         54 
       
         55 
       
         56 
       
         57 
       
         58 
       
         59 
       
        #!/usr/bin/env python 
       
        #! coding=utf-8 
       
        from 
          
        mrjob 
        . 
        job  
        import 
          
        MRJob 
       
        from 
          
        itertools 
          
        import 
          
        combinations 
       
        from 
          
        math 
          
        import 
          
        sqrt 
       
        class 
          
        Step2 
        ( 
        MRJob 
        ) 
        : 
       
        def 
          
        pairwise_items 
        ( 
        self 
        , 
          
        user_id 
        , 
          
        values 
        ) 
        : 
       
        ''' 
       
                本mapper使用step1的输出作为输入，把user_id丢弃掉不再使用 
       
                输出结果为 （item_1,item2），(rating_1,rating_2) 
       
                这里combinations(iterable,number)的作用是求某个集合的组合， 
       
                如combinations([1,2,3,4],2)就是在集合种找出任两个数的组合。 
       
                这个mapper是整个任务的性能瓶颈，这是因为combinations函数生成的数据 
       
                比较多，这么多的零散数据依次写回磁盘，IO操作过于频繁，可以用写一个 
       
                Combiner来紧接着mapper做一些聚合操作（和Reducer相同），由Combiner 
       
                把数据写回磁盘，该Combiner也可以用C库来实现，由Python调用。 
       
                ''' 
       
        # 这里由于step1是分开的，把数据dump到文件result1.csv中，所以读取的时候 
       
        # 需要按照字符串处理，如果step1和step2在同一个job内完成，则直接可以去掉 
       
        # 这一行代码，在同一个job内完成参见steps函数的使用说明。 
       
        values 
          
        = 
          
        eval 
        ( 
        values 
        . 
        split 
        ( 
        '\t' 
        ) 
        [ 
        1 
        ] 
        ) 
       
        item_count 
        , 
          
        item_sum 
        , 
          
        ratings 
          
        = 
          
        values 
       
        for 
          
        item1 
        , 
          
        item2  
        in 
          
        combinations 
        ( 
        ratings 
        , 
          
        2 
        ) 
        : 
       
        yield 
          
        ( 
        item1 
        [ 
        0 
        ] 
        , 
          
        item2 
        [ 
        0 
        ] 
        ) 
        , 
          
        ( 
        item1 
        [ 
        1 
        ] 
        , 
          
        item2 
        [ 
        1 
        ] 
        ) 
       
        def 
          
        calculate_similarity 
        ( 
        self 
        , 
          
        pair_key 
        , 
          
        lines 
        ) 
        : 
       
        ''' 
       
                (Movie A,Movie B)作为Key，(A rating,B rating)作为该reducer的输入， 
       
                每一次输入属于同一个用户，所有当两个key相同时，代表他们两个都看了A和B，所以 
       
                按照这些所有都看了A、B的人的评分作为向量，计算A、B的皮尔逊系数。 
       
                ''' 
       
        sum_xx 
        , 
          
        sum_xy 
        , 
          
        sum_yy 
        , 
          
        sum_x 
        , 
          
        sum_y 
        , 
          
        n 
          
        = 
          
        ( 
        0.0 
        , 
          
        0.0 
        , 
          
        0.0 
        , 
          
        0.0 
        , 
          
        0.0 
        , 
          
        0 
        ) 
       
        item_pair 
        , 
          
        co_ratings 
          
        = 
          
        pair_key 
        , 
          
        lines 
       
        item_xname 
        , 
          
        item_yname 
          
        = 
          
        item_pair 
       
        for 
          
        item_x 
        , 
          
        item_y  
        in 
          
        co_ratings 
        : 
       
        sum_xx 
          
        + 
        = 
          
        item_x 
          
        * 
          
        item_x 
       
        sum_yy 
          
        + 
        = 
          
        item_y 
          
        * 
          
        item_y 
       
        sum_xy 
          
        + 
        = 
          
        item_x 
          
        * 
          
        item_y 
       
        sum_y 
          
        + 
        = 
          
        item_y 
       
        sum_x 
          
        + 
        = 
          
        item_x 
       
        n 
          
        + 
        = 
          
        1 
       
        similarity 
          
        = 
          
        self 
        . 
        normalized_correlation 
        ( 
        n 
        , 
          
        sum_xy 
        , 
          
        sum_x 
        , 
          
        sum_y 
        , 
          
        sum_xx 
        , 
          
        sum_yy 
        ) 
       
        yield 
          
        ( 
        item_xname 
        , 
          
        item_yname 
        ) 
        , 
          
        ( 
        similarity 
        , 
          
        n 
        ) 
       
        def 
          
        steps 
        ( 
        self 
        ) 
        : 
       
        return 
          
        [ 
        self 
        . 
        mr 
        ( 
        mapper 
        = 
        self 
        . 
        pairwise_items 
        , 
       
        reducer 
        = 
        self 
        . 
        calculate_similarity 
        ) 
        , 
        ] 
       
        def 
          
        normalized_correlation 
        ( 
        self 
        , 
        n 
        , 
        sum_xy 
        , 
        sum_x 
        , 
        sum_y 
        , 
        sum_xx 
        , 
        sum_yy 
        ) 
        : 
       
        numerator 
          
        = 
          
        ( 
          
        n 
        * 
        sum_xy 
          
        - 
          
        sum_x 
        * 
        sum_y 
          
        ) 
       
        denominator 
          
        = 
          
        sqrt 
        ( 
        n 
        * 
        sum_xx 
          
        - 
          
        sum_x 
        * 
        sum_x 
        ) 
          
        * 
          
        sqrt 
        ( 
        n 
        * 
        sum_yy 
          
        - 
          
        sum_y 
        * 
        sum_y 
        ) 
       
        similarity 
          
        = 
          
        numerator 
          
        / 
          
        denominator 
       
        return 
          
        similarity 
       
        if 
          
        __name__ 
          
        == 
          
        '__main__' 
        : 
       
        Step2 
        . 
        run 
        ( 
        )

使用命令 $python step2.py result1.csv > result2.csv 获得第二步的结果。

获得结果集示例：

[Movie A, Movie B] [similarity, rating count]

 
   Python 
  
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Star Trek: Generations (1994)" 
        ] 
             
        [ 
        0.31762191045234545 
        , 
          
        93 
        ] 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Star Trek: The Motion Picture (1979)" 
        ] 
             
        [ 
        0.4632318663542742 
        , 
          
        96 
        ] 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Star Trek: The Wrath of Khan (1982)" 
        ] 
             
        [ 
        0.44969297939248015 
        , 
          
        148 
        ] 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Star Wars (1977)" 
        ] 
             
        [ 
        0.08625580124837125 
        , 
          
        151 
        ] 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Stargate (1994)" 
        ] 
             
        [ 
        0.30431878197511564 
        , 
          
        94 
        ] 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Stars Fell on Henrietta, The (1995)" 
        ] 
             
        [ 
        1.0 
        , 
          
        2 
        ] 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Starship Troopers (1997)" 
        ] 
             
        [ 
        0.14969005091372395 
        , 
          
        59 
        ] 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Steal Big, Steal Little (1995)" 
        ] 
             
        [ 
        0.74535599249993 
        , 
          
        5 
        ] 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Stealing Beauty (1996)" 
        ] 
             
        [ 
        - 
        0.4879500364742666 
        , 
          
        10 
        ] 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Steel (1997)" 
        ] 
             
        [ 
        1.0 
        , 
          
        2 
        ] 
       
        [ 
        "Star Trek VI: The Undiscovered Country (1991)" 
        , 
          
        "Stephen King's The Langoliers (1995)" 
        ] 
             
        [ 
        - 
        0.11470786693528087 
        , 
          
        16 
        ]

可以看到结果还是具有一定的实际价值的，需要注意的是，Stars Fell on Henrietta, The (1995) 这部电影是1.0，也就是完全相关，但是由于只有两个人同时对他们进行了评价，所以结果并非全都很正确，这里还要考虑多少人进行了评价。

结语

本文的内容来自于参考资料中的博客，博主仅做了整理工作，有任何问题可以和我交流。需要指出的是，类似于本文中的电影推荐仅仅是众多推荐算法中一种，可以说是对物品进行相似度判断，实际上也可以根据用户进行用户相似度判断，相似的用户总是喜欢相同的电影，这在实践中效果更好一点，也更容易根据社交关系进一步挖掘。

参考资料：http://aimotion.blogspot.com.br/2012/08/introduction-to-recommendations-with.html