需求:
根据access_log数据,按平台与渠道维度统计出某一天之前的独立identifier数量。
Hdfs文件对应的表结构信息:
源表
目标表
1、mapper程序
/Users/nisj/PycharmProjects/import_oss/src/final_remain_mapper.py
2、reducer程序
/Users/nisj/PycharmProjects/import_oss/src/final_remain_reducer.py
3、MapReduce程序的调用
3.1、测试模式
3.2、正式调用模式
4、MapReduce在一个脚本中
/Users/nisj/PycharmProjects/import_oss/src/final_remain_mapper_all.py
调用代码:
5、结果验证
在Hive查询的的等同Hql代码:
测试结果说明:
MapReduce方式的代码,在Linux上用cat管道测试的结果与Hql在Hive上执行结果一致;通过Hadoop streaming jar调用的结果会有重复,重复的数目与reducer数一致,当reducer任务数设置为我1时,则数据结果一致且正确,若不为1,需要进行手工的再一次汇总。
具体原因,目前还不太清楚,会在下一篇换一种方式进行MapReduce代码编写。
单个Mapper文件的方式,cat管道测试正确,Hadoop streaming jar调用时的结果不准确。具体原因,容我过一段时间再想。
根据access_log数据,按平台与渠道维度统计出某一天之前的独立identifier数量。
Hdfs文件对应的表结构信息:
源表
CREATE TABLE `bi_all_access_log`(
`appsource` string,
`appkey` string,
`identifier` string,
`uid` string)
PARTITIONED BY (
`pt_month` string,
`pt_day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'line.delim'='\n',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hdfs-host:9000/user/hive/warehouse/bi_all_access_log'
TBLPROPERTIES (
'transient_lastDdlTime'='1481872367')
目标表
CREATE TABLE `xxxxx`(
`appsource` string,
`appkey` string,
`cnt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'line.delim'='\n',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hdfs-host:9000/nisj/mp_result'
TBLPROPERTIES (
'last_modified_by'='hadoop',
'last_modified_time'='1482325282',
'numFiles'='0',
'totalSize'='0',
'transient_lastDdlTime'='1482325282')
1、mapper程序
/Users/nisj/PycharmProjects/import_oss/src/final_remain_mapper.py
#!/usr/bin/env python
# encoding: utf-8
import sys
# input comes from STDIN (standard input)
ma_list = []
for line in sys.stdin:
line = line.split(',')
ma_list.append(line[0:3])
for ml in ma_list:
print ml[0] + "," + ml[1] + "," + ml[2]
2、reducer程序
/Users/nisj/PycharmProjects/import_oss/src/final_remain_reducer.py
#!/usr/bin/env python
# encoding: utf-8
import sys
from itertools import groupby
from operator import itemgetter
# input comes from STDIN
ma_set = set()
for line in sys.stdin:
ml = line.replace('\n', '').split(',')
ma_set.add(tuple(ml))
ma_list = list(ma_set)
ma_list_sorted = sorted(ma_list, key=itemgetter(0, 1))
groupby_ma_list = groupby(ma_list_sorted, key=itemgetter(0, 1))
frl = []
for key, item in groupby_ma_list:
item_cnt = 0
for jtem in item:
item_cnt += 1
final_remain_list = (key, item_cnt)
frl.append(final_remain_list)
for fl in frl:
print str(fl[0][0])+","+str(fl[0][1])+","+str(fl[1])
3、MapReduce程序的调用
3.1、测试模式
hadoop dfs -cat /user/hive/warehouse/bi_all_access_log/pt_month=2016-12/pt_day=2016-12-03/000000_0 |head -n 1000 |python final_remain_mapper.py |python final_remain_reducer.py
hadoop dfs -cat /user/hive/warehouse/bi_all_access_log/pt_month=2016-12/pt_day=2016-12-03/000000_0 |python final_remain_mapper.py |python final_remain_reducer.py
3.2、正式调用模式
hadoop dfs -rmr /nisj/mp_result
hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/bi/final_remain_mapper.py -file /home/hadoop/bi/final_remain_mapper.py \
-reducer /home/hadoop/bi/final_remain_reducer.py -file /home/hadoop/bi/final_remain_reducer.py \
-input hdfs://hdfs-host:9000/user/hive/warehouse/bi_all_access_log/pt_month=2016-11/*/* \
-output /nisj/mp_result
-----------------------------------------------
hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/bi/final_remain_mapper.py -file /home/hadoop/bi/final_remain_mapper.py \
-reducer /home/hadoop/bi/final_remain_reducer.py -file /home/hadoop/bi/final_remain_reducer.py \
-input hdfs://hdfs-host:9000/user/hive/warehouse/bi_all_access_log/pt_month=2016-12/pt_day=2016-12-03/000000_0 \
-output /nisj/mp_result \
-numReduceTasks 1
4、MapReduce在一个脚本中
/Users/nisj/PycharmProjects/import_oss/src/final_remain_mapper_all.py
#!/usr/bin/env python
# encoding: utf-8
import sys
from itertools import groupby
from operator import itemgetter
# input comes from STDIN (standard input)
ma_set = set()
for line in sys.stdin:
line = line.split(',')
ma_set.add(tuple(line[0:3]))
ma_list = list(ma_set)
ma_list_sorted = sorted(ma_list, key=itemgetter(0, 1))
groupby_ma_list = groupby(ma_list_sorted, key=itemgetter(0, 1))
frl = []
for key, item in groupby_ma_list:
item_cnt = 0
for jtem in item:
item_cnt += 1
final_remain_list = (key, item_cnt)
frl.append(final_remain_list)
for fl in frl:
print str(fl[0][0])+","+str(fl[0][1])+","+str(fl[1])
调用代码:
-- 只有一个mapper
hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
-mapper /home/hadoop/bi/final_remain_mr.py -file /home/hadoop/bi/final_remain_mr.py \
-input hdfs://hdfs-host:9000/user/hive/warehouse/bi_all_access_log/*/*/* \
-output /nisj/mp_result
5、结果验证
在Hive查询的的等同Hql代码:
select appkey,appsource,count(*)
from (select appkey,appsource,identifier from bi_all_access_log group by appkey,appsource,identifier) x
group by appkey,appsource;
-------------------------------------
select appkey,appsource,count(*)
from (select appkey,appsource,identifier from bi_all_access_log where pt_day='2016-12-03' group by appkey,appsource,identifier) x
group by appkey,appsource;
测试结果说明:
MapReduce方式的代码,在Linux上用cat管道测试的结果与Hql在Hive上执行结果一致;通过Hadoop streaming jar调用的结果会有重复,重复的数目与reducer数一致,当reducer任务数设置为我1时,则数据结果一致且正确,若不为1,需要进行手工的再一次汇总。
具体原因,目前还不太清楚,会在下一篇换一种方式进行MapReduce代码编写。
单个Mapper文件的方式,cat管道测试正确,Hadoop streaming jar调用时的结果不准确。具体原因,容我过一段时间再想。