第十章:使用进程、线程和协程提供并发性-multiprocessing:像线程一样管理进程-实现MapReduce

10.4.18 实现MapReduce
Pool类可以用于创建一个简单的单服务器MapReduce实现。尽管它无法充分提供分布处理的好处,但这种方法显示其能够很容易地将一些问题分解为可分布的工作单元。在基于MapReduce的系统中,输入数据分解为块,由不同的工作进程实例处理。首先使用一个简单的转换将各个输入数据块映射到一个中间状态。然后将中间数据汇集在一起,基于键值分区,使所有相关的值都在一起。最后,将分区的数据归约为一个结果集。

# multiprocessing_mapreduce.py
import collections
import itertools
import multiprocessing


class SimpleMapReduce:

    def __init__(self,map_func,reduce_func,num_workers=None):
        """
        map_func

        function to map inputs to intermediate data, Takes as
        argument one input value and return a tuple with the
        key and a value to be reduced.

        reduce_func

        function to reduce partitioned version of intermediate
        data to final output. Takes as argument key as
        produced by map_func and a sequence of the values
        associated with that key.

        num_workers

        The number of workers to create in the pool.Defaults
        to the number of CPUs available on the current host.
        """
        self.map_func = map_func
        self.reduce_func = reduce_func
        self.pool = multiprocessing.Pool(num_workers)

    def partition(self,mapped_values):
        """Organize the mapped values by their key.
        Returns an unsorted sequence of tuples with a key
        and a sequence of values.
        """
        partitioned_data = collections.defaultdict(list)
        for key,value in mapped_values:
            partitioned_data[key].append(value)
        return partitioned_data.items()

    def __call__(self,inputs,chunksize=1):
        """Process the inputs through the map and reduce functions
        given.

        inputs
        An iterable containing the input data to be processed.

        chunksize=1
        The portion of the input data to hand to each worker.
        This can be used to tune performance during the mapping
        phase.
        """
        map_responses = self.pool.map(
            self.map_func,
            inputs,
            chunksize=chunksize,
            )
        partitioned_data = self.partition(
            itertools.chain(*map_responses)
            )
        reduced_values = self.pool.map(
            self.reduce_func,
            partitioned_data,
            )
        return reduced_values

以下实例脚本使用SimpleMapReduce统计这篇文章reStructuredText源中的“单词”数,这里要忽略其中的一些标记。

import multiprocessing
import string

from multiprocessing_mapreduce import SimpleMapReduce


def file_to_words(filename):
    """Read a file and return a sequence of
    (word,occurences) values.
    """
    STOP_WORDS = set([
        'a','an','and','are','as','be','by','for','if',
        'in','is','it','of','or','py','rst','that','the',
        'to','with',
        ])
    TR = str.maketrans({
        p:' '
        for p in string.punctuation
        })

    print('{} reading {}'.format(
        multiprocessing.current_process().name,filename))
    output = []

    with open(filename,'rt') as f:
        for line in f:
            # Skip comment lines.
            if line.lstrip().startswith('..'):
                continue
            line = line.translate(TR)  # Strip punctuation.
            for word in line.split():
                word = word.lower()
                if word.isalpha() and word not in STOP_WORDS:
                    output.append((word,1))
    return output

def count_words(item):
    """Convert the partitioned data for a word to a
    tuple containing the word and the number of occurences.
    """
    word,occurences = item
    return (word,sum(occurences))


if __name__ == '__main__':
    import operator
    import glob

    input_files = glob.glob('*.rst')

    mapper = SimpleMapReduce(file_to_words,count_words)
    word_counts = mapper(input_files)
    word_counts.sort(key=operator.itemgetter(1))
    word_counts.reverse()

    print('\nTOP 20 WORDS BY FREQUENCY\n')
    top20 = word_counts[:20]
    longest = max(len(word) for word,count in top20)
    for word,count in top20:
        print('{word:<{len}}:{count:5}'.format(
            len=longest + 1,
            word=word,
            count=count)
              )

file_to_words()函数将各个输入文件转换为一个元组序列,各元组包含单词和数字1(表示一次出现)。partition()使用单词作为键来划分数据,所以得到的结构包括一个键和一个1值序列(表示单词的每次出现)。分区数据被转换为一组元组,元组中包含一个单词和归约阶段中count_words()统计得出的这个单词的出现次数。
运行结果:
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值