HIVE SQL 聚合函数与 rows between / range between详解

一、rows between 与 range between 用法

1. 相关关键词解析

unbounded 无边界
preceding 往前
following 往后
unbounded preceding 往前所有行,即初始行
n preceding 往前n行
unbounded following 往后所有行,即末尾行
n following 往后n行
current row 当前行

语法
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

2. rows between ... and ...

rows:指以行号来决定frame的范围,是物理意义上的行。

比如rows between 1 preceding and 1 following代表从当前行往前一行以及往后一行。

3. range between ... and ...

range:指以当前行在开窗函数中的值为根基,然后按照order by进行排序,最后根据range去加减上下界。是逻辑意义上的行。

比如sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) 表示按照id分组,按照score升序排序,然后以当前行的score,下界减一,上界加一,作为范围,将这范围里的score进行加总。

讲的比较拗口,下面看个例子就懂了。

二、举例

1. 数据准备

假设有表datadev.t_student,数据如下

idscore
stu_11
stu_12
stu_13
stu_14
stu_15
stu_15

2. 测试 rows between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id) as a1,
sum(score) over (PARTITION by id order by score) as a2,
sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as a3,
sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as a4,
sum(score) over (PARTITION by id order by 1) as a5
from datadev.t_student;

测试结果如下:

分析:

  1. sum(score) over (PARTITION by id) as a1:按照id分组直接加总score,这种大家最熟悉了
  2. sum(score) over (PARTITION by id order by score) as a2:按照score排序,从起始行到当前行进行加总,与a3中的ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW不同的是,当score相同时,算相同排名,会一起加总。类似rank的概念。
  3. sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):从起始行到当前行进行加总,与a2不同的是,当score相同时,排名不同,不会加总到当前行。类似row_number的概念。
  4. sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):从起始行到末尾行进行加总,与a1相同。
  5. sum(score) over (PARTITION by id order by 1):作用与a2一样,这里order by 1,相当于score相同,因此全部加总。

a1与a2在官网上的解释如下:

  • When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
  • When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

因此,a1与a2等价于

SELECT id, score,
sum(score) over (PARTITION by id) as a1,
sum(score) over (PARTITION by id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as a1,
sum(score) over (PARTITION by id order by score) as a2,
sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as a2,
from datadev.t_student;

 官网地址如下:

LanguageManual WindowingAndAnalytics - Apache Hive - Apache Software Foundation

3. 测试 range between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as b1,
sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING) as b2
from datadev.t_student;

测试结果如下:

 分析:

  1. sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 是默认值,可不写。
  2. sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING):按照id分组,score升序排序,并将当前行的score下界减一,上界为全部(可认为是无穷大),作为筛选范围。最后将符合筛选范围的score进行相加。

b2运算过程分析如下:

idscore运算过程运算b2
stu_11[当前行的score值 - 1,∞] ==> 即[0, ∞]1+2+3+4+5+5=2020
stu_12[当前行的score值 - 1,∞] ==> 即[1, ∞]1+2+3+4+5+5=2020
stu_13[当前行的score值 - 1,∞] ==> 即[2, ∞]2+3+4+5+5=1919
stu_14[当前行的score值 - 1,∞] ==> 即[3, ∞]3+4+5+5=1717
stu_15[当前行的score值 - 1,∞] ==> 即[4, ∞]4+5+5=1414
stu_15[当前行的score值 - 1,∞] ==> 即[4, ∞]4+5+5=1414

4. 对比 range between ... and ... 与 rows between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) as a,
sum(score) over (PARTITION by id order by score ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as b
from datadev.t_student;

测试结果如下:

 分析:

idscorerange运算过程range运算a(range)rows运算过程rows运算b(rows)
stu_11[当前行的score值 - 1,当前行的score值 + 1] ==> 即[0, 2]1+2=33将当前行的上一行以及下一行
的score进行相加
1+2=33
stu_12[当前行的score值 - 1,当前行的score值 + 1] ==> 即[1, 3]1+2+3=66将当前行的上一行以及下一行
的score进行相加
1+2+3=66
stu_13[当前行的score值 - 1,当前行的score值 + 1] ==> 即[2, 4]2+3+4=99将当前行的上一行以及下一行
的score进行相加
2+3+4=99
stu_14[当前行的score值 - 1,当前行的score值 + 1] ==> 即[3, 5]3+4+5+5=1717将当前行的上一行以及下一行
的score进行相加
3+4+5=1212
stu_15[当前行的score值 - 1,当前行的score值 + 1] ==> 即[4, 6]4+5+5=1414将当前行的上一行以及下一行
的score进行相加
4+5+5=1414
stu_15[当前行的score值 - 1,当前行的score值 + 1] ==> 即[4, 6]4+5+5=1414将当前行的上一行以及下一行
的score进行相加
5+5=1010

参考文档:Hive 窗口与分析型函数

  • 40
    点赞
  • 118
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值