hadoop jar hadoop-streaming-2.6.4.jar \
-D mapreduce.job.name='test' \
-files /local/path/to/mapper.py,/local/path/to/reducer.py
-input /test/data/*
-output /test/output/
-mapper 'python /local/path/to/mapper.py'
-reducer 'python /local/path/to/reducer.py'
1. python文件需要分发到每个节点
2. -mapper和-reducer后面必须带python,否则会报错
Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, No such file or director
mapper.py
#!/usr/bin/python3#-*- coding: utf-8 -*-
importosimportsysimportrefor line insys.stdin:
line=line.strip()
words= re.split('[,.?\s"]',line)for word inwords:
word= word.strip(',|.|?|\s')ifword:print("{0}\t{1}".format(word,1))
reducer.py
#!/usr/bin/env python#-*- coding: utf-8 -*-
importosimportsysfrom operator importitemgetter
current_word=None
current_count=0
word=Nonefor line insys.stdin:
word= line.split('\t',1)[0]
count= line.split('\t',1)[1]
count=int(count)if current_word ==word:
current_count+=countelse:ifcurrent_word:print("{0}\t{1}".format(current_word,current_count))
current_word=word
current_count=countifword:print("{0}\t{1}".format(current_word,current_count))
参考官方说明: https://hadoop.apache.org/docs/r2.7.7/hadoop-streaming/HadoopStreaming.html