本文参考http://michaelnielsen.org/blog/page/19/
从MapReduce的经典例子—单词统计开始。
一个MapReduce job的输入是一个(input_key, input_value)
这样的键值对集合。键值对集合可以使用python的dictionary数据类型来表示。在单词统计例子中,input_key
是文件名,input_value
是文件内容。
filenames = ['a.txt', 'b.txt', 'c.txt']
i = {}
for filename in filenames:
f = open(filename)
i[filename] = f.read()
f.close()
python dictionary i 包含着MapReduce job的所有输入。a.txt, b.txt和c.txt的内容如下:
text\a.txt:
The quick brown fox jumped over the lazy grey dogs.
text\b.txt:
That's one small step for a man, one giant leap for mankind.
text\c.txt:
Mary had a little lamb,
Its fleece was white as snow;
And everywhere that Mary went,
The lamb was sure to go.
一个MapReduce job分为两个阶段:map阶段和reduce阶段。map阶段产生 intermediate keys and values,这些 intermediate keys and values再由reduce阶段处理。在map阶段,一个mapper函数mapper(input_key,input_value)
处理每个MapReduce job的输入字典i中的键值对(input_key,input_value)
。mapper(input_key,input_value)
产生由intermediate keys and values组成的列表。mapper("a.txt", i["a.txt"])
产生:
[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1),
('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)]
mapper函数定义如下
def mapper(input_key,input_value):
return [(word,1) for word in
remove_punctuation(input_value.lower()).split()]
def remove_punctuation(s):
return s.translate(string.maketrans("",""), string.punctuation)
定义这样的mapper函数后,map阶段的产出就是针对输入的字典i调用mapper函数( mapper(“a.txt”), mapper(“b.txt”)和 mapper(“c.txt”) )所返回结果的合并:
[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1),
('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1),
('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1),
('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1),
('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1),
('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1),
('was', 1), ('sure', 1), ('to', 1), ('go', 1), ('thats', 1),
('one', 1), ('small', 1), ('step', 1), ('for', 1), ('a', 1),
('man', 1), ('one', 1), ('giant', 1), ('leap', 1), ('for', 1),
('mankind', 1)]
接下来进入reduce阶段
MapReduce为reduce阶段做一些预处理:将map阶段产生的intermediate keys and values列表中含有相同key的value放在一起,生成一个中间字典intermediate dictionary:
{'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1],
'go': [1], 'its': [1], 'lamb': [1, 1], 'giant': [1],
'for': [1, 1], 'jumped': [1], 'had': [1], 'snow': [1],
'to': [1], 'leap': [1], 'white': [1], 'was': [1, 1],
'mary': [1, 1], 'brown': [1], 'lazy': [1], 'sure': [1],
'that': [1], 'little': [1], 'small': [1], 'step': [1],
'everywhere': [1], 'mankind': [1], 'went': [1], 'man': [1],
'a': [1, 1], 'fleece': [1], 'grey': [1], 'dogs': [1],
'quick': [1], 'the': [1, 1, 1], 'thats': [1]}
reduce阶段调用reducer函数,reducer(intermediate_key,intermediate_value_list)
作用在intermediate dictionary中的每一项上。单词统计的例子中,reducer函数将intermediate_key
对应的
intermediate_value_list中的值加起来:
def reducer(intermediate_key,intermediate_value_list):
return (intermediate_key,sum(intermediate_value_list))
intermediate dictionary经过reduce阶段的处理,输出:
[('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1),
('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2),
('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1),
('white', 1), ('was', 2), ('mary', 2), ('brown', 1),
('lazy', 1), ('sure', 1), ('that', 1), ('little', 1),
('small', 1), ('step', 1), ('everywhere', 1), ('mankind', 1),
('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1),
('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]
参考程序:
#word_count.py
import string
import map_reduce
def mapper(input_key,input_value):
return [(word,1) for word in
remove_punctuation(input_value.lower()).split()]
def remove_punctuation(s):
return s.translate(string.maketrans("",""), string.punctuation)
def reducer(intermediate_key,intermediate_value_list):
return (intermediate_key,sum(intermediate_value_list))
filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]
i = {}
for filename in filenames:
f = open(filename)
i[filename] = f.read()
f.close()
print map_reduce.map_reduce(i,mapper,reducer)
map_reduce模块:
# map_reduce.py
import itertools
def map_reduce(i,mapper,reducer):
intermediate = []
for (key,value) in i.items():
intermediate.extend(mapper(key,value))
groups = {}
for key, group in itertools.groupby(sorted(intermediate),
lambda x: x[0]):
groups[key] = list([y for x, y in group])
return [reducer(intermediate_key,groups[intermediate_key])
for intermediate_key in groups]