hive通过range between 开窗时间滑动窗口
问题
之前,有个需求,有点类似spark的开窗函数实现的东西。不知道在hive上怎么实现。
百度了一下,没有发现好的实现方法,有一个是通过join来实现的,不是很优雅。自己通过开窗函数也出来一个,总感觉也不是很完美。
解决
今天深入学习了一下range between 和rows between的区别以后,突然发现range between就可以完美的实现这个功能。
贴出来交流一下,供大家参考使用。
回忆一下range between 和 rows between的用法
create table tmp.tmp_test_range_rows(
id int,
date_day date
) stored as orc
;
insert into tmp.tmp_test_range_rows (id, date_day)
values (1, '2023-01-01')
, (1, '2023-01-03')
, (3, '2023-01-05')
, (6, '2023-01-07')
, (6, '2023-01-08')
, (6, '2023-01-09')
, (7, '2023-01-10')
, (8, '2023-01-12')
, (9, '2023-01-15')
;
select id,sum(id) over(order by id) as sum_id
, sum(id) over(order by id range between unbounded preceding and current row ) as sum_range
, sum(id) over(order by id range between 1 preceding and 2 following ) as sum_range2
, sum(id) over(order by id rows between 1 preceding and 2 following ) as sum_rows2
from tmp.tmp_test_range_rows a
;
id | sum_id | sum_range | sum_range2 | sum_rows2 |
---|---|---|---|---|
1 | 2 | 2 | 5 | 5 |
1 | 2 | 2 | 5 | 11 |
3 | 5 | 5 | 3 | 16 |
6 | 23 | 23 | 33 | 21 |
6 | 23 | 23 | 33 | 25 |
6 | 23 | 23 | 33 | 27 |
7 | 30 | 30 | 42 | 30 |
8 | 38 | 38 | 24 | 24 |
9 | 47 | 47 | 17 | 17 |
分析
- 当order by后面的rows/range缺失时,默认是range between
- 关于sum_range2 分析,order by id,所以要看id的值,并对id进行range操作,即[id-1, id+2]
hive 开窗实现滑动窗口
有了上面的经验,我们可以对日期range,然后开窗滑动
例如,我们开个近7天的窗口
select id, date_day
, sum(id) over(order by date_day range between 6 preceding and current row ) as sum_range2
, sum(id) over(order by date_day rows between 6 preceding and current row ) as sum_rows2
from tmp.tmp_test_range_rows a
;
id | date_day | sum_range2 | sum_rows2 |
---|---|---|---|
1 | 2023-01-01 | 1 | 1 |
1 | 2023-01-03 | 2 | 2 |
3 | 2023-01-05 | 5 | 5 |
6 | 2023-01-07 | 11 | 11 |
6 | 2023-01-08 | 16 | 17 |
6 | 2023-01-09 | 22 | 23 |
7 | 2023-01-10 | 28 | 30 |
8 | 2023-01-12 | 33 | 37 |
9 | 2023-01-15 | 30 | 45 |
类似实现了滑动窗口的效果
月的和小时的可以使用类似的技术来实现
交流
欢迎大家留言交流hql的花样写法。