可参考博客: https://zhuanlan.zhihu.com/p/113245904
窗口函数
概述
- 窗口函数是一组特殊函数
扫描多个输入行来计算每个输出值,为每行数据生成一行结果
可以通过窗口函数来实现复杂的计算和聚合 - 语法
function (arg1,..., arg n) over ([partition by <...>] [order by <....>] [<window_clause>])
-
- partition by类似于group by,未指定则按整个结果集
-
- 只有指定group by子句之后才能进行窗口定义
-
- 可同时使用多个窗口函数
-
- 过滤窗口函数计算结果必须在外面一层
- 按功能可划分为:排序,聚合,分析
案例:
create database wintest;
use wintest;
create table test(name string,buydate string,num int)
row format delimited
fields terminated by ',';
select * from test;
select *,count(*) over() from test;
select *,count(*) over(partition by name) from test;
select *,count(*) over(partition by name order by num) from test;
select *,count(*) over(partition by name order by buydate desc) from test;
select * from (select *,count(*) over(partition by name order by buydate desc) win1 from test) a where a.win1<3;
select *,count(1) over() from test where buydate like "2018-01-%";
排序
- row_number()
对所有数值输出不同的序号,序号唯一连续
select *,row_number() over() from test;
select *,row_number() over(partition by name) from test;
- rank()
对相同数值,输出相同的序号,下一个序号跳过(1,1,3)
select * ,rank() over() win1 from test;
select * ,rank() over(partition by name) win1 from test;
select * ,rank() over(partition by name order by buydate) win1 from test;
- dense_rank()
对相同数值,输出相同的序号,下一个序号连续(1,1,2)
select * ,dense_rank() over(partition by name order by buydate) win1 from test;
- ntile(n)
用于对分组数据按照顺序切片分成n片,返回当前切片值
select *,ntile(4) over(partition by name) from test;
- percent_rank()
(目前排名- 1)/(总行数- 1),值相对于一组值的百分比排名
聚合
- count()
计数,可以和distinct一起用 - sum():求和
- avg():平均值
- max()/min(): 最大/小值
从Hive 2.1.0开始在over子句中支持聚合函数
分析
- cume_dist
小于等于当前值的行数/分组内总行数
select *,cume_dist() over(partition by name order by buydate) from test;
- lead/lag(col,n)
某一列进行往前/后取第n行值(n可选,默认为1)
select *,lag(buydate) over(partition by name order by buydate) from test;
select *,lag(buydate) over(partition by name order by buydate) lastdate,lag(num) over(partition by name order by buydate) from test;
- first_value
对该列到目前为止的首个值
select *,lag(buydate) over(partition by name order by buydate) lastdate,lag(num) over(partition by name order by buydate),first_value(num) over(partition by name order by buydate) from test;
- last_value
到目前行为止的最后一个值
select *,lag(buydate) over(partition by name order by buydate) lastdate,lag(num) over(partition by name order by buydate),last_value(num) over(partition by name order by buydate) from test;
窗口定义
-
窗口定义由[<window_clause>]子句描述
用于进一步细分结果并应用分析函数 -
支持两类窗口定义
行类型窗口
范围类型窗口 -
rank、ntile、dense_rank、cume_dist、percent_rank、lead、lag和row_number函数不支持与窗口子句一起使用
-
行窗口:根据当前行之前或之后的行号确定的窗口
rows between <start_expr> and <end_expr> -
<start_expr>可以为下列值
unbounded preceding: 窗口起始位置(分组第一行)
current row:当前行
n preceding/following:当前行之前/之后n行 -
<end_expr>可以为下列值
unbounded following : 窗口结束位置(分组最后一行)
current row:当前行
n preceding/following:当前行之前/之后n行
案例:
select *,count(1) over(partition by name order by buydate desc rows between unbounded preceding and current row ) from test;
select *,count(1) over(partition by name order by buydate desc rows between 1 preceding and current row) from test;
select *,count(1) over(partition by name order by buydate desc rows between 1 preceding and 1 following) from test;
- 范围窗口是取分组内的值在指定范围区间内的行
range between <start_expr> and <end_expr>
该范围值/区间必须是数字或日期类型
目前只支持一个order by列
select *,count(num) over(partition by name order by num range between 20 preceding and 10 following) rst from test;