利用Python和Hadoop Streaming实现Map Reduce的简单案例

4 篇文章 0 订阅
1 篇文章 0 订阅

需求:得到每个月温度最高的两天

mapper.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
-------------------------------
FileName: mapper
Author: Tgw
Date: 19-07-28
-------------------------------
Change Activity: 19-07-28

Data format:
2010-10-01 14:21:02 34C
2009-10-05 15:21:02 28C
2002-03-12 14:21:02 30C
2011-06-22 15:21:02 26C
2015-08-10 14:21:02 36C

Need: 得到每个月天气最高的两天

"""
import sys

list_for_sort = []

for line in sys.stdin:
    # 捕获输入流
    line = line.strip()
    date,time,temperature = line.split()

    list_for_sort.append( (date,temperature) )

list_for_sort = sorted(list_for_sort, key=lambda x:x[1],reverse=True)

for i in range(len(list_for_sort)):
    print("%s\t%s" % list_for_sort[i])

reducer.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
-------------------------------
FileName: reducer
Author: Tgw
Date: 19-07-28
-------------------------------
Change Activity: 19-07-28
"""
import sys

word_dir = {}
month_limit_num = 2
month_limit_str = ()

for line in sys.stdin:

    line = line.strip()
    date, temperature = line.split('\t')

    year, month, day = str(date).split('-')

    if (year, month) != month_limit_str: # 相同年月的数据只能有两个
        month_limit_num = 2 # 到下一个年月时,month_limit_num更新

    if month_limit_num > 0 and date not in word_dir:
        word_dir.setdefault(date, temperature)
        month_limit_num -= 1
        month_limit_str = (year, month)


for k, v in word_dir.items():
    print('%s\t%s' % (k,v))


启动的脚本 run:

HADOOP_CMD="/home/wcc/Myapp/hadoop-2.6.5/bin/hadoop"
STREAM_JAR_PATH="/home/wcc/Myapp/hadoop-2.6.5/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar "

INPUT_FILE_PATH_1="/user/root/input/tq.txt" #输入路径
OUTPUT_PATH="/user/root/output/high_temperatrue" #输出路径
$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH #每次执行时都删除输出路径,否则会出错

$HADOOP_CMD jar $STREAM_JAR_PATH \
                -input $INPUT_FILE_PATH_1 \
                -output $OUTPUT_PATH \
                -mapper "python mapper.py" \
                -reducer "python reducer.py" \
                -file ./mapper.py \
                -file ./reducer.py

测试:

hdfs dfs -cat /user/root/input/tq.txt | ./mapper.py | sort -k1,1 | ./reducer.py

放入集群工作:

  1. 先将run修改为可执行文件
chmod +x run
  1. 启动脚本
. run
  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值