MapReduce实例

最新推荐文章于 2024-08-17 21:28:57 发布

bdss58

最新推荐文章于 2024-08-17 21:28:57 发布

阅读量561

点赞数

分类专栏：算法文章标签： mapreduce

本文链接：https://blog.csdn.net/bdss58/article/details/51292448

版权

算法专栏收录该内容

23 篇文章 0 订阅

订阅专栏

本文参考http://michaelnielsen.org/blog/page/19/
从MapReduce的经典例子—单词统计开始。
一个MapReduce job的输入是一个（input_key, input_value)这样的键值对集合。键值对集合可以使用python的dictionary数据类型来表示。在单词统计例子中，input_key是文件名，input_value是文件内容。

filenames = ['a.txt', 'b.txt', 'c.txt']
i = {}
for filename in filenames:
    f = open(filename)
    i[filename] = f.read()
    f.close()

python dictionary i 包含着MapReduce job的所有输入。a.txt, b.txt和c.txt的内容如下：

text\a.txt:

The quick brown fox jumped over the lazy grey dogs.

text\b.txt:

That's one small step for a man, one giant leap for mankind.

text\c.txt:

Mary had a little lamb,
Its fleece was white as snow;
And everywhere that Mary went,
The lamb was sure to go.

一个MapReduce job分为两个阶段：map阶段和reduce阶段。map阶段产生 intermediate keys and values，这些 intermediate keys and values再由reduce阶段处理。在map阶段，一个mapper函数mapper(input_key,input_value)处理每个MapReduce job的输入字典i中的键值对(input_key,input_value)。mapper(input_key,input_value)产生由intermediate keys and values组成的列表。mapper("a.txt", i["a.txt"])产生：

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), 
 ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)]

mapper函数定义如下

 def mapper(input_key,input_value):
  return [(word,1) for word in 
          remove_punctuation(input_value.lower()).split()]

def remove_punctuation(s):
  return s.translate(string.maketrans("",""), string.punctuation)

定义这样的mapper函数后，map阶段的产出就是针对输入的字典i调用mapper函数（ mapper(“a.txt”), mapper(“b.txt”)和 mapper(“c.txt”) ）所返回结果的合并：

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), 
 ('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), 
 ('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1), 
 ('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1), 
 ('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1), 
 ('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1), 
 ('was', 1), ('sure', 1), ('to', 1), ('go', 1), ('thats', 1), 
 ('one', 1), ('small', 1), ('step', 1), ('for', 1), ('a', 1), 
 ('man', 1), ('one', 1), ('giant', 1), ('leap', 1), ('for', 1), 
 ('mankind', 1)]

接下来进入reduce阶段
MapReduce为reduce阶段做一些预处理：将map阶段产生的intermediate keys and values列表中含有相同key的value放在一起，生成一个中间字典intermediate dictionary：

{'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1], 
 'go': [1], 'its': [1], 'lamb': [1, 1], 'giant': [1], 
 'for': [1, 1], 'jumped': [1], 'had': [1], 'snow': [1], 
 'to': [1], 'leap': [1], 'white': [1], 'was': [1, 1], 
 'mary': [1, 1], 'brown': [1], 'lazy': [1], 'sure': [1], 
 'that': [1], 'little': [1], 'small': [1], 'step': [1], 
 'everywhere': [1], 'mankind': [1], 'went': [1], 'man': [1], 
 'a': [1, 1], 'fleece': [1], 'grey': [1], 'dogs': [1], 
 'quick': [1], 'the': [1, 1, 1], 'thats': [1]}

reduce阶段调用reducer函数，reducer(intermediate_key,intermediate_value_list)作用在intermediate dictionary中的每一项上。单词统计的例子中，reducer函数将intermediate_key对应的
intermediate_value_list中的值加起来：

def reducer(intermediate_key,intermediate_value_list):
  return (intermediate_key,sum(intermediate_value_list))

intermediate dictionary经过reduce阶段的处理，输出：

[('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1), 
 ('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2), 
 ('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1), 
 ('white', 1), ('was', 2), ('mary', 2), ('brown', 1), 
 ('lazy', 1), ('sure', 1), ('that', 1), ('little', 1), 
 ('small', 1), ('step', 1), ('everywhere', 1), ('mankind', 1), 
 ('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1), 
 ('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]

参考程序：

#word_count.py

import string
import map_reduce

def mapper(input_key,input_value):
  return [(word,1) for word in 
          remove_punctuation(input_value.lower()).split()]

def remove_punctuation(s):
  return s.translate(string.maketrans("",""), string.punctuation)

def reducer(intermediate_key,intermediate_value_list):
  return (intermediate_key,sum(intermediate_value_list))

filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]
i = {}
for filename in filenames:
  f = open(filename)
  i[filename] = f.read()
  f.close()

print map_reduce.map_reduce(i,mapper,reducer)

map_reduce模块：

# map_reduce.py

import itertools

def map_reduce(i,mapper,reducer):
  intermediate = []
  for (key,value) in i.items():
    intermediate.extend(mapper(key,value))
  groups = {}
  for key, group in itertools.groupby(sorted(intermediate), 
                                      lambda x: x[0]):
    groups[key] = list([y for x, y in group])
  return [reducer(intermediate_key,groups[intermediate_key])
          for intermediate_key in groups]