窗口函数的基本语法如下:
<窗口函数> over (partition by <用于分组的列名>
order by <用于排序的列名>)
<窗口函数>的位置,可以放以下两种函数:
- 专用窗口函数
- 序号函数:row_number() / rank() / dense_rank()
- 分布函数:percent_rank() / cume_dist()
- 前后函数:lag() / lead()
- 头尾函数:first_val() / last_val()
- 其他函数:nth_value() (用途:返回窗口中第N个expr的值,expr可以是表达式,也可以是列名)/ntile()(用途:将分区中的有序数据分为n个桶,记录桶号
- 聚合函数,如sum(), avg(), count(), max(), min()等
注意事项
- 窗口函数对where和group by子句处理后的结果进行操作,所以原则上窗口函数只能写在select子句中
- 聚合函数是将多条记录聚合为一条;而窗口函数是每条记录都会执行,查询结果并不会改变记录条数,有几条记录执行完还是几条。
- 普通聚合函数也可以用于窗口函数中,赋予它窗口函数的功能。
- 窗口函数的执行顺序(逻辑上的)是在FROM,JOIN,WHERE,GROUP BY,HAVING之后,在ORDER BY,LIMIT,SELECT,DISTINCT之前。它执行时GROUP BY的聚合过程已经完成了,所以不会再产生数据聚合。
建表
create table student (sid char(2), sname char(5), sclass char(2));
create table course (cid char(2), cname char(10));
create table score (sid char(2), cid char(2), score int);
insert into student values('01', '崔健', '01');
insert into student values('02', '李健', '01');
insert into student values('03', '高虎', '01');
insert into student values('04', '子健', '01');
insert into student values('05', '石璐', '01');
insert into student values('06', '亚千', '01');
insert into student values('07', '史立', '01');
insert into student values('08', '窦唯', '01');
insert into student values('09', '华东', '01');
insert into course values('01', '金属');
insert into course values('02', '迷幻');
insert into course values('03', '朋克');
insert into course values('04', '后摇');
insert into score values('01', '01', 60);
insert into score values('02', '01', 85);
insert into score values('03', '01', 57);
insert into score values('04', '01', 34);
insert into score values('05', '01', 78);
insert into score values('06', '01', 90);
insert into score values('07', '01', 76);
insert into score values('08', '01', 90);
insert into score values('09', '01', 85);
insert into score values('01', '02', 78);
insert into score values('02', '02', 59);
insert into score values('03', '02', 59);
insert into score values('04', '02', 79);
insert into score values('05', '02', 88);
insert into score values('01', '03', 65);
insert into score values('03', '03', 89);
insert into score values('05', '03', 46);
insert into score values('06', '03', 85);
insert into score values('07', '03', 89);
insert into score values('08', '03', 79);
insert into score values('03', '04', 99);
insert into score values('04', '04', 95);
insert into score values('07', '04', 68);
insert into score values('08', '04', 59);
insert into score values('09', '04', 80);
1.专用窗口函数
1.1序号函数
- row_number() 、 rank() 、 dense_rank()都是序号函数,一个例子说明三者的区别,对每门课程的成绩排序:
SELECT s.sname, c.cname, sc.score,
ROW_NUMBER() OVER (PARTITION BY c.cname
ORDER BY sc.score DESC) AS row_num,
RANK() OVER (PARTITION BY c.cname
ORDER BY sc.score DESC) AS ranking,
DENSE_RANK() OVER(PARTITION BY c.cname
ORDER BY sc.score DESC) AS dense_ranking
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid
- row_number 相同成绩不会并列,按出现顺序排名
- rank 相同成绩会并列,且下一名为并列排名+并列人数
- dense_rank 相同成绩会并列,并且下一名为并列排名+1
1.2分布函数
- percent_rank()
用途:和之前的RANK()函数相关,每行按照如下公式进行计算:
(rank - 1) / (rows - 1)
其中,rank为RANK()函数产生的序号,rows为当前窗口的记录总行数。
SELECT s.sname, c.cname, sc.score, RANK() OVER(PARTITION BY c.cname ORDER BY sc.score DESC) as ranking,
PERCENT_RANK() OVER (PARTITION BY c.cname
ORDER BY sc.score DESC) as percent
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid
- cume_dist()
用途:分组内大于等于当前rank值的行数/分组内总行数,这个函数比percen_rank使用场景更多。
应用场景:某门课程各同学排名在前百分之几
SELECT s.sname, c.cname, sc.score, RANK() OVER(PARTITION BY c.cname ORDER BY sc.score DESC) as ranking,
CUME_DIST() OVER (PARTITION BY c.cname
ORDER BY sc.score DESC) as cumdist
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid;
亚千和窦唯的金属成绩并列第一,在班级排名前22.22%
1.3前后函数
lag和lead函数可以在同一次查询中取出同一字段的前N行数据(lag)和后N行数据(lead)
语法:
LAG(EXP_STR,OFFSET,DEFVAL)OVER()
LEAD(EXP_STR,OFFSET,DEFVAL)OVER()
EXP_STR
:要取的列
OFFSET
: 取偏移后的第几行数据
DEFVAL
:无偏移值的取值,默认为NULL
应用场景:求每个用户相邻两次浏览的时间差;求每个同学相邻两门考试的成绩差
SELECT s.sname, c.cname, sc.score,
lead(sc.score,1) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as leadVal,
lag(sc.score,1) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as lagVal,
score - leadVal as diff1,
score - lagVal as diff2
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid;
1.4头尾函数
- first_val()/last_val()
用途:得到分区中的第一个/最后一个指定参数的值。
SELECT s.sname, c.cname, sc.score,
FIRST_VALUE(sc.score) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as firstVal,
LAST_VALUE(sc.score) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as lastVal
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid
1.5其他函数
- nth_value()
SELECT s.sname, s.sclass, c.cname, sc.score,
nth_value(sc.score,1) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as 1th,
nth_value(sc.score,2) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as 2th
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid
- ntile()
- 用途:将分区中的有序数据分为n个桶,记录桶号。
- 此函数在数据分析中应用较多,比如由于数据量大,需要将数据平均分配到N个并行的进程分别计算,此时就可以用NFILE(N)对数据进行分组,由于记录数不一定被N整除,所以数据不一定完全平均,多出来的部分则依次加给第一组、第二组···直到分配完。
SELECT
sid, cid, score, ntile(4) over(partition by sid order by score desc) as tile
DROM
score;
2.聚合函数
聚合函数用作窗口函数分为两种情况
2.1 不指定分组
SELECT
sid, cid, score,
sum(score) over (order by sid) as current_sum,
avg(score) over (order by sid) as current_avg,
count(score) over (order by sid) as current_count,
max(score) over (order by sid) as current_max,
min(score) over (order by sid) as current_min
FROM
score
- 结果解释
聚合函数sum在窗口函数中,是对自身记录、及位于自身记录以上的数据进行求和的结果。当排序字段有相同值时,会对这些相同值执行一次聚合函数。
如按照sid排序,sid为01的记录有三条,则对这三条记录执行聚合函数,并在此基础上再进行接下来的累积(相当于把这三条重复记录执行一次聚合函数,然后看成一行)
排序字段没有重复的情况:
select sid, cid, score,
sum(score) over (order by sid,cid) as current_sum,
avg(score) over (order by sid,cid) as current_avg,
count(score) over (order by sid,cid) as current_count,
max(score) over (order by sid,cid) as current_max,
min(score) over (order by sid,cid) as current_min
from score
- 进行的是累积操作,在每一个新的行都对包括此行在内的之前所有行执行一次聚合函数。
2.2 指定分组
select cid, sid, score,
sum(score) over (partition by cid order by sid) as current_sum,
avg(score) over (partition by cid order by sid) as current_avg,
count(score) over (partition by cid order by sid) as current_count,
max(score) over (partition by cid order by sid) as current_max,
min(score) over (partition by cid order by sid) as current_min
from score
体现了窗口
根据cid分窗,在每个cid中执行2.1中不分窗的操作
2.3 滑动窗口
1、unbounded preceding
:从当前分区的第一行开始,到当前行结束。
2、current row
:从当前行开始,也结束于当前行。
3、[numeric expression] preceding
:对于rows来说从当前行之前的第[numeric expression]行开始(共[numeric expression]+1行),到当前行结束。对range来说从小于数值表达式的值开始,到当前行结束。
4、[numeric expression] following
:与[numeric expression] preceding相反。