聚合函数是将多条记录聚合为一条,而窗口函数是每条记录都会执行
win_fn()over (partition by order by frame) as new_row,
partition:分区,窗口函数在不同分组分别执行
order:排序
frame:在分区滑动窗口,可使用一般聚合函数比如avg等
固定窗口函数
Name | Description |
---|---|
CUME_DIST() over_clause | Cumulative distribution value |
DENSE_RANK() over_clause | Rank of current row within its partition, without gaps |
FIRST_VALUE(expr) [null_treatment] over_clause | Value of argument from first row of window frame |
LAG(expr [, N[, default]]) [null_treatment] over_clause | Value of argument from row lagging current row within partition |
LAST_VALUE(expr) [null_treatment] over_clause | Value of argument from last row of window frame |
LEAD(expr [, N[, default]]) [null_treatment] over_clause | Value of argument from row leading current row within partition |
NTH_VALUE(expr, N) [from_first_last] [null_treatment] over_clause | Value of argument from N-th row of window frame |
NTILE(N) over_clause | Bucket number of current row within its partition. |
PERCENT_RANK() over_clause | Percentage rank value |
RANK() over_clause | Rank of current row within its partition, with gaps |
ROW_NUMBER() over_clause | Number of current row within its partition |
创建测试数据框
create table ss.test(
id INT,
c1 INT,
c2 INT
)
insert into ss.test (c1,c2) values(2,6);
insert into ss.test (c1,c2) values(1,3);
insert into ss.test (c1,c2) values(1,2);
insert into ss.test (c1,c2) values(1,1);
insert into ss.test (c1,c2) values(2,4);
insert into ss.test (c1,c2) values(2,4);
c2 | c1 |
---|---|
6 | 2 |
3 | 1 |
2 | 1 |
1 | 1 |
4 | 2 |
4 | 2 |
测试
序号函数——row_number() / rank() / dense_rank()
比较区别
row_number不对重复值处理
rank()重复值会并列,后面的值按整体的序号
dense_rank()重复值会并列,后面的值不按整体的序号
select c1,c2,row_number() over (partition by c1 order by c2 asc) as 'new' from ss.test; # 按c1分组,c2排序,给出每行在分组中的序号
c1 | c2 | new |
---|---|---|
1 | 1 | 1 |
1 | 2 | 2 |
1 | 3 | 3 |
2 | 4 | 1 |
2 | 4 | 2 |
2 | 6 | 3 |
select c1,c2,rank() over (partition by c1 order by c2 asc) as 'new' from ss.test; # c1分组,c2排序,求排名
c1 | c2 | new |
---|---|---|
1 | 1 | 1 |
1 | 2 | 2 |
1 | 3 | 3 |
2 | 4 | 1 |
2 | 4 | 1 |
2 | 6 | 3 |
select c1,c2,dense_rank() over (partition by c1 order by c2 asc) as 'new' from ss.test; # c1分组,c2排序,有重复的排名只计算一次
c1 | c2 | new |
---|---|---|
1 | 1 | 1 |
1 | 2 | 2 |
1 | 3 | 3 |
2 | 4 | 1 |
2 | 4 | 1 |
2 | 6 | 2 |
应用分组topn
select * from (select c1,c2,row_number() over (partition by c1 order by c2 asc) as 'new' from ss.test) as temp where new <= 2 ;
c1 | c2 | new |
---|---|---|
1 | 1 | 1 |
1 | 2 | 2 |
2 | 4 | 1 |
2 | 4 | 2 |
分布函数——percent_rank()/cume_dist()。
percent_rank() : (rank - 1) / (rows - 1),rank为RANK()函数产生的序号,rows为当前窗口的记录总行数
cume_dist():分组内小于等于当前rank值的行数/分组内总行数
select c1,c2,percent_rank() over (partition by c1 order by c2 asc) as 'new' from ss.test; # c1分组,c2排序,求排名百分比
c1 | c2 | new |
---|---|---|
1 | 1 | 0 |
1 | 2 | 0.5 |
1 | 3 | 1 |
2 | 4 | 0 |
2 | 4 | 0 |
2 | 6 | 1 |
select c1,c2,cume_dist() over (partition by c1 order by c2 asc) as 'new' from ss.test; # c1分组,c2排序,小于等于当前rank行的百分比
c1 | c2 | new |
---|---|---|
1 | 1 | 0.3333333333333333 |
1 | 2 | 0.6666666666666666 |
1 | 3 | 1 |
2 | 4 | 0.6666666666666666 |
2 | 4 | 0.6666666666666666 |
2 | 6 | 1 |
应用
查看分区后,小于当前值占的百分比,比如小于某个人考试成绩的百分比
前后函数——lead(n)/lag(n)
lead(n):都是错行取值
lag(n):都是错行取值
select c1,c2,lead(c2) over (partition by c1 order by c2 asc) as 'new' from ss.test; # c1分组,c2排序,错一行取值
c1 | c2 | new |
---|---|---|
1 | 1 | 2 |
1 | 2 | 3 |
1 | 3 | |
2 | 4 | 4 |
2 | 4 | 6 |
2 | 6 |
select c1,c2,lag(c2) over (partition by c1 order by c2 asc) as 'new' from ss.test; # c1分组,c2排序,错一行取值
c1 | c2 | new |
---|---|---|
1 | 1 | |
1 | 2 | 1 |
1 | 3 | 2 |
2 | 4 | |
2 | 4 | 4 |
2 | 6 | 4 |
应用
按时间排序后,查询两笔订单之间的时间差
取记录函数——first_val(expr)/last_val(expr)/nth_value(expr)
select c1,c2,first_value(c2) over (partition by c1 order by c2 asc) as 'new' from ss.test;# c1分组,c2排序,取每组c2第一个值
c1 | c2 | new |
---|---|---|
1 | 1 | 1 |
1 | 2 | 1 |
1 | 3 | 1 |
2 | 4 | 4 |
2 | 4 | 4 |
2 | 6 | 4 |
select c1,c2,last_value(c2) over (partition by c1 order by c2 asc) as 'new' from ss.test; # c1分组,c2排序,取每组c2最后一个值
c1 | c2 | new |
---|---|---|
1 | 1 | 1 |
1 | 2 | 2 |
1 | 3 | 3 |
2 | 4 | 4 |
2 | 4 | 4 |
2 | 6 | 6 |
select c1,c2,nth_value(c2,2) over (partition by c1 order by c2 asc) as 'new' from ss.test; # c1分组,c2排序,取每组c2第二个值
c1 | c2 | new |
---|---|---|
1 | 1 | |
1 | 2 | 2 |
1 | 3 | 2 |
2 | 4 | 4 |
2 | 4 | 4 |
2 | 6 | 4 |
应用
取分区后指定位置的值,比如取每个班第二名的成绩
顺序分组
select c1,c2,ntile(2) over (partition by c1 order by c2 asc) as 'new' from ss.test; # c1分组,c2排序,按排序进行分组
c1 | c2 | new |
---|---|---|
1 | 1 | 1 |
1 | 2 | 1 |
1 | 3 | 2 |
2 | 4 | 1 |
2 | 4 | 1 |
2 | 6 | 2 |
滑动frame
frame的语法是BETWEEN frame_start AND frame_end
WINDOW w AS (partition by c1 order by c2 asc ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
生成一个frame在partitioner
分区里滑动
rows BETWEEN 1 PRECEDING AND 1 FOLLOWING
当前行的前一行和后一行,共三行记录
rows UNBOUNDED FOLLOWING
当前行到分区后的最后一行
INTERVAL 7 DAY PRECEDING
一周以内的
窗口命名
1.窗口名称
使用WINDOW w AS
将替换over后面的语句,让语句更加清晰
select c1,
c2,
cume_dist() over w as 'new'
from ss.test
WINDOW w AS (partition by c1 order by c2 asc);
参考:
https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html
https://www.cnblogs.com/DataArt/p/9961676.html