1新建文件mapper_python.py
编辑:
#!/usr/bin/python
#-*- coding: utf-8 -*-
import sys
for line in sys.stdin:
line = line.strip()
words = line.split(' ')
count=int(count)
for word in words:
print "%s\t%s" % (word, 1)
2新建reducer_python.py
编写:
#!/usr/bin/python
#-*- coding: utf-8 -*-
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
count=int(count)
if current_word == word:
current_count += count
else:
if current_word:
print "%s\t%s" % (current_word, current_count)
current_count = count
current_word = word
if word == current_word:
print "%s\t%s" % (current_word, current_count)
3本地测试实现效果图:
原文件:
4集群测试实现效果图
yarn jar .jar包 \
-files mapper_python.py,reducer_python.py \
-Dmapreduce.job.name="Python StreamingJob WordCount" \
-Dmapred.output.compress=false \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-Dmapred.reduce.tasks=1 \
-input /tmp/tianliangedu/input/whitespace_wordcount \
-output /tmp/tianliangedu/output100 \
-mapper "python mapper_python.py" \
-reducer "python reducer_python.py"
最后通过hdfs dfs -cat 输出文件