在大数据面试中,Hive知识的考察大部分会问级联求和,业务场景虽然有很多种,比如说,年收入,月收入统计;访客访问次数年统计,月统计。等等。但是基本根源知识是级联求和,本文就以访客访问统计为例。
1、基本需求
根据访客的每日访问信息,进行累计访问:
- 输入数据:
有如下访客访问次数统计表 t_access_times
为了减轻计算复杂度,去掉了天的信息只留下了年月。
- 输出数据:
需要输出报表:t_access_times_accumulate
2、实现步骤
2.1 原始数据
vi /home/chunsoft/t_access_times.dat
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
2.2 创建表
jdbc:hive2://172.16.208.128:10000>create table t_access_times(username string,month string,access_time int)
row format delimited fields terminated by ',';
2.3 将数据加载到表中
jdbc:hive2://172.16.208.128:10000>
load data local inpath '/home/chunsoft/t_access_times.dat' into table t_access_times;
2.4 计算单个用户的月访问次数
jdbc:hive2://172.16.208.128:10000>
select username,month,sum(access_time) as access_time from t_access_times group by username,month;
+-----------+----------+---------+--+
| username | month | access_time |
+-----------+----------+---------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+--+
2.4 上面的表自己跟自己进行inner join方便求总访问
select A.*,B.* from
(select username,month,sum(access_time) as access_time from t_access_times group by username,month) A
inner join
(select username,month,sum(access_time) as access_time from t_access_times group by username,month) B
on
A.username=B.username
+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username | a.month | a.access_time | b.username | b.month | b.access_time |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+-----------+--+
2.5 根据当前月份的大小来判断累计求和并排序
select A.username,A.month,max(A.access_time) access_time,sum(B.access_time) accumulate
from
(select username,month,sum(access_time) as access_time from t_access_times group by username,month) A
inner join
(select username,month,sum(access_time) as access_time from t_access_times group by username,month) B
on
A.username=B.username
where B.month<=A.month
group by A.username,A.month
order by A.username,A.month
+-------------+----------+---------+-------------+--+
| a.username | a.month | access_time | accumulate |
+-------------+----------+---------+-------------+--+
| A | 2015-01 | 33 | 33 |
| A | 2015-02 | 10 | 43 |
| B | 2015-01 | 30 | 30 |
| B | 2015-02 | 15 | 45 |
这里面的max(A.access_time) 起到一个聚合作用,保留一条结果。