现有这么一批数据,现要求出:
每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数
三个字段的意思:
用户名,月份,访问次数
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11
建表准备:
create table mianshi01(name string, month string, num int) row format delimited fields terminated by ",";
load data local inpath "/home/hadoop/visits.txt" into table mianshi01;
select * from mianshi01;
desc mianshi01;
开启本地执行和笛卡尔积:
set hive.exec.mode.local.auto=true;
1)先求每个用户,每个月的访问次数,并存入中间表。
create table bak01 as select name,month, sum (num) from mianshi01 group by name,month order by name;
+-------------+--------------+------------+
| bak01.name | bak01.month | bak01.num |
+-------------+--------------+------------+
| A | 2015-03 | 38 |
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-02 | 15 |
| B | 2015-03 | 44 |
| B | 2015-01 | 30 |
+-------------+--------------+------------+
2)求到当前行的最大和累计(pv3,测试用)
select name ,month ,num ,
max(num) over(partition by name order by month desc) as pv2,
sum(num) over(partition by name order by month desc) as pv1,
max(num) over(partition by name) as pv3
from bak01;
这里需要注意over里面的order by 的作用:
没有排序意味着:聚合函数在整个分区范围中计算。
有排序意味着:聚合函数默认从起始行到当前行范围内计算。
以下结果中在每一组数据中,的pv3是不变的,而pv2每行不同。
+-------+----------+------+------+------+------+
| name | month | num | pv2 | pv1 | pv3 |
+-------+----------+------+------+------+------+
| A | 2015-01 | 33 | 38 | 81 | 38 |
| A | 2015-02 | 10 | 38 | 48 | 38 |
| A | 2015-03 | 38 | 38 | 38 | 38 |
| B | 2015-01 | 30 | 44 | 89 | 44 |
| B | 2015-02 | 15 | 44 | 59 | 44 |
| B | 2015-03 | 44 | 44 | 44 | 44 |
+-------+----------+------+------+------+------+