连续级联累计查询
- 基于Hive。同样在DB2,Oracle,MySQL8.0都可以适配
需求:
- 统计:某一用户在一天中所出现过的位置,以及在此位置上停留的时长。原始数据是以小时的粒度划分的,现在要将连续多个小时出现在某一位置的数据汇总到此连续时间段的首次出现的的此位置对应的那一个小时,将同一位置连续停留的数据汇总成一条。
- 格式:原始数据数据如下,效果数据看最后一张图。
原始数据 input.csv
UserA,LocationA,2018-01-01 08:00:00,60
UserA,LocationA,2018-01-01 09:00:00,60
UserA,LocationB,2018-01-01 10:00:00,60
UserA,LocationA,2018-01-01 11:00:00,60
UserA,LocationC,2018-01-01 12:00:00,60
UserA,LocationC,2018-01-01 13:00:00,60
UserA,LocationC,2018-01-01 14:00:00,60
UserA,LocationA,2018-01-01 15:00:00,60
cd /opt
mkdir my_test
touch input.csv
vi input.csv
create table `test_input`(
user_id string,
location_id string,
start_time string,
duration string
)
row format delimited fields terminated by ',';
load data local inpath '/opt/my_test/input.csv' into table test_input;
select tt.*,rk-dr from (
select t.*,dense_rank() over(partition by location_id order by start_time) dr,
row_number() over(order by start_time) rk
from test_input t order by start_time
) tt;
- 结果
- 第二步(最关键,需要强有力的数学逻辑支撑)
select ttt.*,row_number() over(partition by ttt.location_id,md order by ttt.start_time desc) new_rk from (
select tt.*,rk-dr md from (
select t.*,dense_rank() over(partition by location_id order by start_time) dr,
row_number() over(order by start_time) rk
from test_input t order by start_time
) tt) ttt order by start_time;
- 结果
- 第三步(进一步加工,实现结果)
select user_id,location_id,start_time,duration*new_mrk from (
select tttt.*,case when new_rk=max(new_rk) over(partition by location_id,md) then new_rk else '0' end new_mrk from (
select ttt.*,row_number() over(partition by ttt.location_id,md order by ttt.start_time desc) new_rk from (
select tt.*,rk-dr md from (
select t.*,dense_rank() over(partition by location_id order by start_time) dr,
row_number() over(order by start_time) rk
from test_input t order by start_time
) tt) ttt)tttt)ttttt
where new_mrk!='0'
order by start_time;
- 最终结果
备注:第二步的转换,绝对正确。有不理解的评论区留言