这学期选了《大数据科学导论》这门课，上课没怎么听。最后为了完成大作业，自己了解了些Hadoop的知识，觉得挺有意思。


MapReduce的过程主要分为Map-Shuffle-Reduce三个步骤（下面引自Wikipedia）：

• “Map” step: Each worker node applies the “map()” function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
• “Shuffle” step: Worker nodes redistribute data based on the output keys (produced by the “map()” function), such that all data belonging to one key is located on the same worker node.
• “Reduce” step: Worker nodes now process each group of output data, per key, in parallel.

import sys
import json

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the line with json method
key = record[0];
value = record[1];

# split the line into words
words = value.split()

for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited;
print('%s\t%s' % (word, key))
import sys

# maps words to their documents
word2set = {}

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, doc = line.split('\t', 1)

# write the results to STDOUT (standard output)
for word in word2set:
print('%-16s%s' % (word, word2set[word]))

MapReduce的本质是一种分治思想，将巨大规模的数据划分为若干块，各个块可以并行处理，通过Map生成若干键值对，Sort & Combine得到关于键的若干等价类，类与类之间没有关联，Reduce对每个类分别进行处理，每个类得到一个独立的结果，最后将每个类的结果合并成为最终结果。许多简单的问题，将它放到MapReduce模型里却充满挑战，而一旦将其成功套入模型，就能借助并行计算的力量解决数据规模巨大的难题。

• 本文已收录于以下专栏：

## 向MapReduce转换：计算共现关系

1 运行环境说明 1.1  硬软件环境 l  主机操作系统：Windows 64 bit，双核4线程，主频2.2G，6G内存 l  虚拟软件：VMware® Workstation 9.0.0 ...