若出现如下错误,
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
将#!/usr/bin/env python插入到python脚本的顶端即可解决。
#mapper.py
#!/usr/bin/env python import sys dic = {} for line in sys.stdin: line = line.strip().split() for key in line: if dic.has_key(key): dic[key] += 1 else: dic[key] = 1 for key, value in dic.items(): print "%s\t%d" % (key, value)
#reducer.py
#!/usr/bin/env python import sys wordcount = {} for line in sys.stdin: line = line.strip() word,count=line.split("\t",1) count=int(count) wordcount[word]=wordcount.get(word,0)+count for word,count in wordcount.items(): print "%s\t%d" % (word, count)
Hadoop命令:
hadoop jar /hadoop/hadoop-streaming-1.1.2.jar -input * -output * -file /home/map.py -mapper map.py -file /home/red.py -reducer red.py
注意:hadoop-streaming-1.1.2.jar并不在hadoop的根目录下,请去/hadoop/contrib/streaming下寻找。