利用Python和Hadoop Streaming实现Map Reduce的简单案例

T`gw

于 2019-07-30 13:12:52 发布

阅读量247

点赞数 2

分类专栏： Python Hadoop Linux 文章标签： python hadoop hadoop Streaming map reduce

本文链接：https://blog.csdn.net/qq_42933123/article/details/97772241

版权

Python 同时被 3 个专栏收录

17 篇文章 0 订阅

订阅专栏

Hadoop

4 篇文章 0 订阅

订阅专栏

Linux

1 篇文章 0 订阅

订阅专栏

需求：得到每个月温度最高的两天

mapper.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
-------------------------------
FileName: mapper
Author: Tgw
Date: 19-07-28
-------------------------------
Change Activity: 19-07-28

Data format:
2010-10-01 14:21:02 34C
2009-10-05 15:21:02 28C
2002-03-12 14:21:02 30C
2011-06-22 15:21:02 26C
2015-08-10 14:21:02 36C

Need: 得到每个月天气最高的两天

"""
import sys

list_for_sort = []

for line in sys.stdin:
    # 捕获输入流
    line = line.strip()
    date,time,temperature = line.split()

    list_for_sort.append( (date,temperature) )

list_for_sort = sorted(list_for_sort, key=lambda x:x[1],reverse=True)

for i in range(len(list_for_sort)):
    print("%s\t%s" % list_for_sort[i])

reducer.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
-------------------------------
FileName: reducer
Author: Tgw
Date: 19-07-28
-------------------------------
Change Activity: 19-07-28
"""
import sys

word_dir = {}
month_limit_num = 2
month_limit_str = ()

for line in sys.stdin:

    line = line.strip()
    date, temperature = line.split('\t')

    year, month, day = str(date).split('-')

    if (year, month) != month_limit_str: # 相同年月的数据只能有两个
        month_limit_num = 2 # 到下一个年月时，month_limit_num更新

    if month_limit_num > 0 and date not in word_dir:
        word_dir.setdefault(date, temperature)
        month_limit_num -= 1
        month_limit_str = (year, month)


for k, v in word_dir.items():
    print('%s\t%s' % (k,v))

启动的脚本 run:

HADOOP_CMD="/home/wcc/Myapp/hadoop-2.6.5/bin/hadoop"
STREAM_JAR_PATH="/home/wcc/Myapp/hadoop-2.6.5/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar "

INPUT_FILE_PATH_1="/user/root/input/tq.txt" #输入路径
OUTPUT_PATH="/user/root/output/high_temperatrue" #输出路径
$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH #每次执行时都删除输出路径，否则会出错

$HADOOP_CMD jar $STREAM_JAR_PATH \
                -input $INPUT_FILE_PATH_1 \
                -output $OUTPUT_PATH \
                -mapper "python mapper.py" \
                -reducer "python reducer.py" \
                -file ./mapper.py \
                -file ./reducer.py

测试：

hdfs dfs -cat /user/root/input/tq.txt | ./mapper.py | sort -k1,1 | ./reducer.py

放入集群工作：

先将run修改为可执行文件

chmod +x run

启动脚本

. run

T`gw

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
利用Python和Hadoop Streaming实现Map Reduce的简单案例

需求：得到每个月温度最高的两天mapper.py#!/usr/bin/env python# -*- coding: utf-8 -*-"""-------------------------------FileName: mapperAuthor: TgwDate: 19-07-28-------------------------------Change Activity...
复制链接

扫一扫

专栏目录