MapReduce
MapReduce是一种计算模型,只不过这种计算模型是在并行计算世界里。
考虑一个简单的例子-单词统计
from collections import Counter
import re
documents = ["data science", "big data", "science fiction"]
def tokenize(message):
message = message.lower()
all_words = re.findall('[a-z0-9]+',message)
return set(all_words)
def word_count_old(documents):
return Counter(word for document in documents
for word in tokenize(document))
print word_count_old(documents)
最简单的统计是这样的,但如果有成千上亿个这样的文档,这个方法就显得特别慢了,还有可能电脑吃不消那么大的数据。
先贴上代码:
def wc_mapper(document):
"""for each word in the document,emit (word,1)"""
for word in tokenize(document):
yield (word,1)
def wc_reducer(word,counts):
yield (word,sum(counts))
def word_count(documents):
collector = defaultdict(list)
for document in documents:
for word,count in wc_mapper(document):
collector[word].append(count)
print collector
return [output for word,counts in collector.iteritems() for output in wc_reducer(word,counts)]
print word_count(documents)
这部分难以理解,我们一步一步来看。就以documents = ["data science", "big data"