需求:得到每个月温度最高的两天
"""
-------------------------------
FileName: mapper
Author: Tgw
Date: 19-07-28
-------------------------------
Change Activity: 19-07-28
Data format:
2010-10-01 14:21:02 34C
2009-10-05 15:21:02 28C
2002-03-12 14:21:02 30C
2011-06-22 15:21:02 26C
2015-08-10 14:21:02 36C
Need: 得到每个月天气最高的两天
"""
import sys
list_for_sort = []
for line in sys.stdin:
line = line.strip()
date,time,temperature = line.split()
list_for_sort.append( (date,temperature) )
list_for_sort = sorted(list_for_sort, key=lambda x:x[1],reverse=True)
for i in range(len(list_for_sort)):
print("%s\t%s" % list_for_sort[i])
"""
-------------------------------
FileName: reducer
Author: Tgw
Date: 19-07-28
-------------------------------
Change Activity: 19-07-28
"""
import sys
word_dir = {}
month_limit_num = 2
month_limit_str = ()
for line in sys.stdin:
line = line.strip()
date, temperature = line.split('\t')
year, month, day = str(date).split('-')
if (year, month) != month_limit_str:
month_limit_num = 2
if month_limit_num > 0 and date not in word_dir:
word_dir.setdefault(date, temperature)
month_limit_num -= 1
month_limit_str = (year, month)
for k, v in word_dir.items():
print('%s\t%s' % (k,v))
启动的脚本 run:
HADOOP_CMD="/home/wcc/Myapp/hadoop-2.6.5/bin/hadoop"
STREAM_JAR_PATH="/home/wcc/Myapp/hadoop-2.6.5/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar "
INPUT_FILE_PATH_1="/user/root/input/tq.txt" #输入路径
OUTPUT_PATH="/user/root/output/high_temperatrue" #输出路径
$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH #每次执行时都删除输出路径,否则会出错
$HADOOP_CMD jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH_1 \
-output $OUTPUT_PATH \
-mapper "python mapper.py" \
-reducer "python reducer.py" \
-file ./mapper.py \
-file ./reducer.py
测试:
hdfs dfs -cat /user/root/input/tq.txt | ./mapper.py | sort -k1,1 | ./reducer.py
放入集群工作:
- 先将run修改为可执行文件
chmod +x run
- 启动脚本
. run