Hive窗口分析函数(案例详细讲解)
一、语法结构
- 语法结构:
分析函数 over(partition by 列名 order by 列名 rows between 开始位置 and 结束位置)
over()
函数中包括三个函数:包括分区partition by 列名
、排序order by 列名
、指定窗口范围rows between 开始位置 and 结束位置
。- 我们在使用over()窗口函数时,over()函数中的这三个函数可组合使用也可以不使用。
over()函数中如果不使用这三个函数,窗口大小是针对查询产生的所有数据,如果指定了分区,窗口大小是针对每个分区的数据。
二、窗口函数
-
① LEAD(col,n,DEFAULT)
- 用于统计窗口内往下第n行值
- 第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)。
-
② LAG(col,n,DEFAULT)
- 用于统计窗口内往上第n行值
- 第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
-
③ FIRST_VALUE
- 取分组内排序后,截止到当前行,第一个值
-
④ LAST_VALUE
- 取分组内排序后,截止到当前行,最后一个值
三、Over从句
-
1.使用标准的聚合函数
COUNT、SUM、MIN、MAX、AVG
-
2.使用
PARTITION BY
语句,使用一个或者多个原始数据类型的列 -
3.使用
PARTITION BY与ORDER BY
语句,使用一个或者多个数据类型的分区或者排序列 -
4.使用窗口规范,窗口规范支持以下格式:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
-
5.窗口范围说明:
我们常使用的窗口范围是
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
(表示从起点到当前行),常用该窗口来计算累加。PRECEDING
:往前
FOLLOWING
:往后
CURRENT ROW
:当前行
UNBOUNDED
:起点(一般结合PRECEDING,FOLLOWING使用)
UNBOUNDED PRECEDING
表示该窗口最前面的行(起点)
UNBOUNDED FOLLOWING
:表示该窗口最后面的行(终点)比如说:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
(表示从起点到当前行)
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING
(表示往前2行到往后1行)
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
(表示往前2行到当前行)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
(表示当前行到终点)当
ORDER BY
后面缺少窗口从句条件(即分析函数 over(partition by 列名 order by 列名)
),窗口规范默认是RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
当
ORDER BY
和窗口从句都缺失(即分析函数 over(partition by 列名)
), 窗口规范默认是ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
注意:
OVER
从句支持以下函数, 但是并不支持和窗口一起使用它们:
Ranking
函数:Rank, NTile, DenseRank, CumeDist, PercentRank.
Lead
和Lag
函数(即Ranking
函数 不能和Lead
、Lag
函数一起使用)
四、分析函数
-
①
RANK
:从1开始,按照顺序按照值排序时产生一个自增编号,值相等时会重复,会产生空位(如:1、2、3、3、3、6) -
②
ROW_NUMBER
:从1开始,按照顺序,按照值排序时产生一个自增编号,不会重复(如:1、2、3、4、5、6) -
③
DENSE_RANK
:从1开始,按照值排序时产生一个自增编号,值相等时会重复,不会产生空位(如:1、2、3、3、3、4) -
④
CUME_DIST
:小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪水的人数,所占总人数的比例 -
⑤
PERCENT_RANK
:分组内当前行的RANK值-1/分组内总行数-1 -
⑥
NTILE(n)
:用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN,比如NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
-
⑦
Distinct
: 去重。如COUNT(DISTINCT a) OVER (PARTITION BY c)
五、COUNT、SUM、MIN、MAX、AVG
本小节主要讲解COUNT、SUM、MIN、MAX、AVG
的用法。
1.数据准备
-------------------------------------------------------------------------
//订单表order : name, order_date, cost
Jan,2020-01-01,87
Jan,2020-01-02,95
Jan,2020-03-03,68
Jan,2020-05-01,68
Ben,2020-04-01,94
Ben,2020-01-02,56
Ben,2020-04-03,84
Ben,2020-05-01,84
Dan,2020-02-01,64
Dan,2020-03-02,86
Dan,2020-04-03,84
Dan,2020-04-01,84
Tim,2020-03-01,65
Tim,2020-02-02,85
Tim,2020-01-03,78
Tim,2020-04-01,78
Bob,2020-02-01,67
Bob,2020-03-02,95
Bob,2020-04-03,70
Bob,2020-05-01,70
------------------------------------------------------------------------
create table order_test(
name string,
order_date string,
cost int)
row format delimited fields terminated by ',';
-------------------------------------------------------------------------
load data local inpath '/tmp/order.txt' into table order;
2.sql示例
//COUNT、SUM、MIN、MAX、AVG
//以 sum函数举例
select
name,
order_date,
cost,
--① over():所有的数据求和。sum_01是一样的。
--求和范围:order_test表的所有数据
sum(cost) over() as sum_01,
--② over(partition by) :按照name分组,对分组相加。组内的sum_02是一个值。
--求和范围:以name分组,每个组内求和
sum(cost) over(partition by name) as sum_02,
--③ over(parition by order by):按照name分组,对分组按照时间升序累加。组内sum_03是一个变化的累加值
--求和范围:默认为从起点到当前行。以name分组,每个组内按order_date累计求和。注意和②的区别
sum(cost) over(partition by name order by order_date) as sum_03,
--④ between unbounded preceding and current row : 同③,从起点到当前行。(order by 省略窗口范围的默认范围)
sum(cost) over(partition by name order by order_date rows between unbounded preceding and current row) as sum_04,
--⑤ rows between n preceding and current row: 以name分组,当前行和前面n行做聚合。聚合的行数为: n+1(当前行)。 示例为 n=1的情况
sum(cost) over(partition by name order by order_date rows between 1 preceding and current row) as sum_05,
--⑥ rows between n1 preceding and n2 following:以name分组,当前行 + 前n1行 +后n2行做聚合。聚合的行数为:n1+n2+1(当前行) 示例n1 n2 =1
sum(cost) over(partition by name order by order_date rows between 1 preceding and 1 following) as sum_06,
--⑦rows between current row and unbounded following:以name分组,当前行+后面的所有行
sum(cost) over(partition by name order by order_date rows between current row and unbounded following) as sum_07
from order_test;
执行上面的sql:
select
name,
order_date,
cost,
sum(cost) over() as sum_01,
sum(cost) over(partition by name) as sum_02,
sum(cost) over(partition by name order by order_date) as sum_03,
sum(cost) over(partition by name order by order_date rows between unbounded preceding and current row) as sum_04,
sum(cost) over(partition by name order by order_date rows between 1 preceding and current row) as sum_05,
sum(cost) over(partition by name order by order_date rows between 1 preceding and 1 following) as sum_06,
sum(cost) over(partition by name order by order_date rows between current row and unbounded following) as sum_07
from order_test;
查询结果如下:
name order_date cost sum_01 sum_02 sum_03 sum_04 sum_05 sum_06 sum_07
Ben