一、概述
- 窗口函数是一组特殊函数
- 扫描多个输入行来计算每个输出值,为每行数据生成一行结果
- 可以通过窗口函数来实现复杂的计算和聚合
- 语法
function (arg1,..., arg n) over ([partition by <...>] [order by <....>] [<window_clause>])
- partition by类似于group by,未指定则按整个结果集
- 只有指定order by子句之后才能进行窗口定义
- 可同时使用多个窗口函数
- 过滤窗口函数计算结果必须在外面一层
- 按功能可划分为:排序,聚合,分析
案例:
#得到男性个数,女性个数
select name,count(id) over(partition by info.sex) from emp_id;
create database test;
use test;
create table test(name string,date string,num int)
row format delimited
fields terminated by ',';
select *,count(*) over() from test;
select *,count(*) over(partition by name) from test;
//组内会按照顺序count
select *,count(*) over(partition by name order by num) from test;
//过滤计算,外面套一层子查询
select * from(select *,count(*) over(partition by name order by date desc) win1 from test) a where a.win1<3;
select *,count(1) over() from test where date like '2018-01-%';
二、排序
- ROW_NUMBER()
对所有数值输出不同的序号,序号唯一连续
select *,row_number() over(partition by name) from test;
- RANK()
对相同数值,输出相同的序号,下一个序号跳过(1,1,3)
select *, rank() over() win1 from test;
select *, rank() over(partition by name) win1 from test;
select *, rank() over(partition by name order by date) win1 from test;
- DENSE_RANK()
对相同数值,输出相同的序号,下一个序号连续(1,1,2)
//不会跳过序号
select *,dense_rank() over(partition by name order by date) win1 from test;
- NTILE(n)
将有序的数据集合平均分配到n个桶中, 将桶号分配给每一行,根据桶号,选取前或后 n分之几的数据
select *,ntile(4) over(partition by name) from test;
- LAG和LEAD
lag返回当前数据行的上一行数据
lead返回当前数据行的下一行数据
//查询顾客上次购买的时间
select *,lag(date) over(partition by name order by date) from test;
- first_value和last_value
select *,lag(date) over(partition by name order by date) lastdate,lag(num) over(partition by name order by date),first_value(num) over(partition by name order by date) from test;
- PERCENT_RANK()
(目前排名- 1)/(总行数- 1),值相对于一组值的百分比排名
三、窗口定义
-
窗口定义由[<window_clause>]子句描述
用于进一步细分结果并应用分析函数 -
支持两类窗口定义
行类型窗口
范围类型窗口 -
rank、ntile、dense_rank、cume_dist、percent_rank、lead、lag和row_number函数不支持与窗口子句一起使用
-
行窗口:根据当前行之前或之后的行号确定的窗口
rows between <start_expr> and <end_expr> -
<start_expr>可以为下列值
unbounded preceding: 窗口起始位置(分组第一行)
current row:当前行
n preceding/following:当前行之前/之后n行 -
<end_expr>可以为下列值
unbounded following : 窗口结束位置(分组最后一行)
current row:当前行
n preceding/following:当前行之前/之后n行
select *,count(1) over(partition by name order by buydate desc rows between unbounded preceding and current row ) from test;
select *,count(1) over(partition by name order by buydate desc rows between 1 preceding and current row) from test;
select *,count(1) over(partition by name order by buydate desc rows between 1 preceding and 1 following) from test;
- 范围窗口是取分组内的值在指定范围区间内的行
该范围值/区间必须是数字或日期类型
目前只支持一个order by列
select *,count(num) over(partition by name order by num range between 20 preceding and 10 following) rst from test;