本文是用python语言并使用hadoop中的streaming来对用户数据进行分析,统计用户的手机号码、上行流量、下行流量、总流量的信息。
本案例适合hadoop初级人员学习。
一、待分析的数据源
二、基本功能实现
三、输出结果
本案例适合hadoop初级人员学习。
一、待分析的数据源
文本文件内容,里面有非常多的用户浏览信息,包括用户手机号码,上网时间,机器序列号,访问的IP,访问的网站,上行流量,下行流量,总流量等信息。(倒数第三列是上传流量,倒数第二列是下载流量,第二列为手机号)
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200
1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200
1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com ???? 15 12 1527 2106 200
1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 20 16 4116 1432 200
1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn ???? 20 20 3156 2936 200
1363157983019 13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82 4 0 240 0 200
1363157984041 13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4 s19.cnzz.com ???? 24 9 6960 690 200
1363157973098 15013685858 5C-0E-8B-C7-F7-90:CMCC 120.197.40.4 rank.ie.sogou.com ???? 28 27 3659 3538 200
1363157986029 15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99 www.umeng.com ???? 3 3 1938 180 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157984040 13602846565 5C-0E-8B-8B-B6-00:CMCC 120.197.40.4 2052.flash2-http.qq.com ???? 15 12 1938 2910 200
1363157995093 13922314466 00-FD-07-A2-EC-BA:CMCC 120.196.100.82 img.qfc.cn 12 12 3008 3720 200
1363157982040 13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99 y0.ifengimg.com ???? 57 102 7335 110349 200
1363157986072 18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99 input.shouji.sogou.com ???? 21 18 9531 2412 200
1363157990043 13925057413 00-1F-64-E1-E6-9A:CMCC 120.196.100.55 t3.baidu.com ???? 69 63 11058 48243 200
1363157988072 13760778710 00-FD-07-A4-7B-08:CMCC 120.196.100.82 2 2 120 120 200
1363157985066 13726238888 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157993055 13560436666 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
二、基本功能实现
flow.py文件
#!/usr/bin/python
# -*- coding:utf-8 -*-
"""
@author:Levy
@file:flow.py
@time:9/26/1610:21 AM
"""
import sys
import gc
import logging
#使用日志模块打印错误日志
logger = logging.getLogger()
logfile = 'flow.log'
hdlr = logging.FileHandler('sendlog.log')
formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
hdlr.setFormatter(formatter)
logger.addHandler(hdlr)
logger.setLevel(logging.NOTSET)
gc.disable()
def mapper():
for line in sys.stdin:
dataline = line.rstrip().split('\t')
if len(dataline) == 9 or 10:
try:
#获得手机号,上传流量,下载流量,汇总流量
phone = dataline[1]
upFlow = dataline[len(dataline) - 3]
downFlow = dataline[len(dataline) - 2]
sumFlow = int(upFlow)+int(downFlow)
except:
continue
print '%s\t%s\t%s\t%s' % (phone, upFlow, downFlow, sumFlow) #手机号,上传流量,下载流量,汇总流量
def reducer():
current_newphone = None
current_newUpFolw = 0
current_newDownFlow = 0
current_newSumFlow = 0
newphone = None
for line in sys.stdin:
newdataline = line.rstrip().split('\t') #截取数据
newphone = newdataline[0]
newUpFolw = newdataline[1]
newDownFlow = newdataline[2]
newSumFlow = newdataline[3]
# print '%s,%s,%s' %(newphone,newUpFolw,newDownFlow)
#上传流量、下载流量、总流量分别求和
if current_newphone == newphone:
current_newUpFolw = int(current_newUpFolw) + int(newUpFolw)
current_newDownFlow = int(current_newDownFlow) + int(newDownFlow)
current_newSumFlow = int(current_newSumFlow) + int(newSumFlow)
# print '%s,%s,%s' % (current_newphone, current_newUpFolw, current_newDownFlow)
else:
if current_newUpFolw:
print '%s,%s,%s,%s' % (current_newphone, current_newUpFolw, current_newDownFlow,current_newSumFlow)
current_newphone = newphone
current_newUpFolw = newUpFolw
current_newDownFlow = newDownFlow
current_newSumFlow = newSumFlow
#输出最后一行数据
if current_newphone == newphone:
print '%s,%s,%s,%s' % (current_newphone, current_newUpFolw, current_newDownFlow,current_newSumFlow)
d = {'mapper': mapper, 'reducer': reducer}
if sys.argv[1] in d:
d[sys.argv[1]]()
flow.sh文件
#!/bin/bash
hdfs_input_path="/user/hdfs/flow/input"
hdfs_output_path="/user/hdfs/flow/output"
hdfs dfs -rm -r ${hdfs_output_path}
hadoop jar /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapreduce.job.queuename=hdfs \
-D mapred.job.name='flow job' \
-D mapred.compress.map.output=true \
-D mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-D mapred.map.tasks=10 \
-D mapred.reduce.tasks=5 \
-input ${hdfs_input_path} \
-output ${hdfs_output_path} \
-file flow.py \
-mapper "python flow.py mapper" \
-reducer "python flow.py reducer"
三、输出结果
13480253104,720,720,1440
13719199419,960,0,960
13726230503,4962,49362,54324
13922314466,12032,14880,26912
15989002119,7752,720,8472
13602846565,7752,11640,19392
13660577991,27840,2760,30600
13925057413,44232,192972,237204
13826544101,1056,0,1056
18211575961,6108,8424,14532
13502468823,29340,441396,470736
13560439658,8136,23568,31704
13760778710,480,480,960
13926435656,528,6048,6576
15013685858,14636,14152,28788
15920133257,12624,11744,24368
13560436666,7194,51270,58464
13726238888,9924,98724,108648
13926251106,960,0,960
18320173382,38124,9648,47772
84138413,16464,5728,22192
参考:
java程序:http://blog.csdn.net/sdksdk0/article/details/51628874
streaming参数:http://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html