map函数
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words :
print "%s\t%s" % (word , 1)
reduce函数
import sys
current_word=None
current_count=0
for line in sys.stdin:
line=line.strip()
word=line.split("\t",1)
if current_word==word[0]:#当前单词如果为本次传过来的单词,则计数加一
current_count=current_count+1
if current_word==None:#第一次判断当前单词是否为空,若为空,赋值,计数为一
current_word=word[0]
current_count=current_count+1
elif current_word!=word[0]:#当前单词如果不为本次传过来的,则先把当前的输出,再赋值,计数
print "%s\t%s" %(current_word,current_count)
current_count=1
current_word=word[0]
print "%s\t%s" %(current_word,current_count)#打印循环结束后,最后一次的单词
测试:
echo "hello word hello Hadoop map reduce" | ./mapper.py |sort -k1,1| ./reducer.py
Python只能对排好序的单词进行计数,在Hadoop中会实现对单词的排序
在Hadoop上运行:
bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file test/code/mapper.py -mapper test/code/mapper.py \
-file test/code/reducer.py -reducer test/code/reducer.py \
-input /user/rte/hdfs_in/* -output /user/rte/hdfs_out