用Python编写MapReduce代码与调用-某一天之前的所有活跃用户统计(1)

40 篇文章 1 订阅
7 篇文章 0 订阅
需求:
根据access_log数据,按平台与渠道维度统计出某一天之前的独立identifier数量。

Hdfs文件对应的表结构信息:
源表
CREATE TABLE `bi_all_access_log`(
  `appsource` string, 
  `appkey` string, 
  `identifier` string, 
  `uid` string)
PARTITIONED BY ( 
  `pt_month` string, 
  `pt_day` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'line.delim'='\n', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://hdfs-host:9000/user/hive/warehouse/bi_all_access_log'
TBLPROPERTIES (
  'transient_lastDdlTime'='1481872367')

目标表
CREATE TABLE `xxxxx`(
  `appsource` string, 
  `appkey` string, 
  `cnt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'line.delim'='\n', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://hdfs-host:9000/nisj/mp_result'
TBLPROPERTIES (
  'last_modified_by'='hadoop', 
  'last_modified_time'='1482325282', 
  'numFiles'='0', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1482325282')

1、mapper程序
/Users/nisj/PycharmProjects/import_oss/src/final_remain_mapper.py
#!/usr/bin/env python
# encoding: utf-8
import sys

# input comes from STDIN (standard input)
ma_list = []
for line in sys.stdin:
    line = line.split(',')
    ma_list.append(line[0:3])

for ml in ma_list:
    print ml[0] + "," + ml[1] + "," + ml[2]

2、reducer程序
/Users/nisj/PycharmProjects/import_oss/src/final_remain_reducer.py
#!/usr/bin/env python
# encoding: utf-8
import sys
from itertools import groupby
from operator import itemgetter

# input comes from STDIN
ma_set = set()
for line in sys.stdin:
    ml = line.replace('\n', '').split(',')
    ma_set.add(tuple(ml))

ma_list = list(ma_set)

ma_list_sorted = sorted(ma_list, key=itemgetter(0, 1))
groupby_ma_list = groupby(ma_list_sorted, key=itemgetter(0, 1))


frl = []
for key, item in groupby_ma_list:
    item_cnt = 0
    for jtem in item:
        item_cnt += 1
        final_remain_list = (key, item_cnt)
    frl.append(final_remain_list)

for fl in frl:
    print str(fl[0][0])+","+str(fl[0][1])+","+str(fl[1])

3、MapReduce程序的调用
3.1、测试模式
hadoop dfs -cat /user/hive/warehouse/bi_all_access_log/pt_month=2016-12/pt_day=2016-12-03/000000_0 |head -n 1000 |python final_remain_mapper.py |python final_remain_reducer.py
hadoop dfs -cat /user/hive/warehouse/bi_all_access_log/pt_month=2016-12/pt_day=2016-12-03/000000_0 |python final_remain_mapper.py |python final_remain_reducer.py

3.2、正式调用模式
hadoop dfs -rmr /nisj/mp_result
hadoop jar  /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/bi/final_remain_mapper.py -file /home/hadoop/bi/final_remain_mapper.py \
-reducer /home/hadoop/bi/final_remain_reducer.py -file /home/hadoop/bi/final_remain_reducer.py \
-input hdfs://hdfs-host:9000/user/hive/warehouse/bi_all_access_log/pt_month=2016-11/*/* \
-output /nisj/mp_result
-----------------------------------------------
hadoop jar  /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/bi/final_remain_mapper.py -file /home/hadoop/bi/final_remain_mapper.py \
-reducer /home/hadoop/bi/final_remain_reducer.py -file /home/hadoop/bi/final_remain_reducer.py \
-input hdfs://hdfs-host:9000/user/hive/warehouse/bi_all_access_log/pt_month=2016-12/pt_day=2016-12-03/000000_0 \
-output /nisj/mp_result \
-numReduceTasks 1

4、MapReduce在一个脚本中
/Users/nisj/PycharmProjects/import_oss/src/final_remain_mapper_all.py
#!/usr/bin/env python
# encoding: utf-8
import sys
from itertools import groupby
from operator import itemgetter

# input comes from STDIN (standard input)
ma_set = set()
for line in sys.stdin:
    line = line.split(',')
    ma_set.add(tuple(line[0:3]))

ma_list = list(ma_set)

ma_list_sorted = sorted(ma_list, key=itemgetter(0, 1))
groupby_ma_list = groupby(ma_list_sorted, key=itemgetter(0, 1))


frl = []
for key, item in groupby_ma_list:
    item_cnt = 0
    for jtem in item:
        item_cnt += 1
        final_remain_list = (key, item_cnt)
    frl.append(final_remain_list)

for fl in frl:
    print str(fl[0][0])+","+str(fl[0][1])+","+str(fl[1])

调用代码:
-- 只有一个mapper
hadoop jar  /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/bi/final_remain_mr.py -file /home/hadoop/bi/final_remain_mr.py \
-input hdfs://hdfs-host:9000/user/hive/warehouse/bi_all_access_log/*/*/* \
-output /nisj/mp_result

5、结果验证
在Hive查询的的等同Hql代码:
select appkey,appsource,count(*)
from (select appkey,appsource,identifier from bi_all_access_log group by appkey,appsource,identifier) x 
group by appkey,appsource;
-------------------------------------
select appkey,appsource,count(*)
from (select appkey,appsource,identifier from bi_all_access_log where pt_day='2016-12-03' group by appkey,appsource,identifier) x 
group by appkey,appsource;

测试结果说明:
MapReduce方式的代码,在Linux上用cat管道测试的结果与Hql在Hive上执行结果一致;通过Hadoop streaming jar调用的结果会有重复,重复的数目与reducer数一致,当reducer任务数设置为我1时,则数据结果一致且正确,若不为1,需要进行手工的再一次汇总。
具体原因,目前还不太清楚,会在下一篇换一种方式进行MapReduce代码编写。
单个Mapper文件的方式,cat管道测试正确,Hadoop streaming jar调用时的结果不准确。具体原因,容我过一段时间再想。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值