一、说明
窗口函数,也叫分析函数,这个功能在大多数据库中早已支持,Mysql8.0也对窗口函数的支持。
窗口的概念 非常重要,它可以理解为记录集合,窗口函数也就是在满足某种条件的记录集合上执行的特殊函数。对于每条记录都要在此窗口内执行函数,有的函数随着记录不同,窗口大小都是固定的,这种属于静态窗口;有的函数则相反,不同的记录对应着不同的窗口,这种动态变化的窗口叫滑动窗口。
简单的说窗口函数就是对于查询的每一行,都使用与该行相关的行进行计算。
窗口函数和普通聚合函数很容易混淆,二者区别如下:
聚合函数是将多条记录聚合为一条;而窗口函数是每条记录都会执行,有几条记录执行完还是几条。
聚合函数也可以用于窗口函数中。
二、窗口函数功能
名称 | 描述 |
---|---|
CUME_DIST() | 计算一组值中一个值的累积分布 |
DENSE_RANK() | 根据该ORDER BY子句为分区中的每一行分配一个等级。它将相同的等级分配给具有相等值的行。如果两行或更多行具有相同的排名,则排名值序列中将没有间隙 |
FIRST_VALUE() | 返回相对于窗口框架第一行的指定表达式的值 |
LAG() | 返回分区中当前行之前的第N行的值。如果不存在前一行,则返回NULL |
LAST_VALUE() | 返回相对于窗口框架中最后一行的指定表达式的值 |
LEAD() | 返回分区中当前行之后的第N行的值。如果不存在后续行,则返回NULL |
NTH_VALUE() | 从窗口框架的第N行返回参数的值 |
NTILE() | 将每个窗口分区的行分配到指定数量的排名组中 |
PERCENT_RANK() | 计算分区或结果集中行的百分数等级 |
RANK() | 与DENSE_RANK()函数相似,不同之处在于当两行或更多行具有相同的等级时,等级值序列中存在间隙 |
ROW_NUMBER() | 为分区中的每一行分配一个顺序整数将上述函数按照功能划分,可以把MySQL支持的 |
将以上窗口函数分为如下几类:
三、窗口函数语法
窗口函数的相关语法是:
[WINDOW window_name AS (window_spec)
[, window_name AS (window_spec)] ...]
window_function_name(window_name/expression)
OVER (
[partition_defintion]
[order_definition]
[frame_definition]
)
四、窗口函数示例
MySQL [test]> select * from t1;
+----+------+--------+---------------------+
| id | name | amount | time |
+----+------+--------+---------------------+
| 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 6 | a2 | 500 | 2019-01-05 00:00:00 |
| 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 10 | a2 | 600 | 2019-01-07 00:00:00 |
+----+------+--------+---------------------+
10 rows in set (0.00 sec)
MySQL [test]> select id,name,amount,time,avg(amount) over w as avg_sum from t1 window w as (partition by name order by time desc rows BETWEEN 1 PRECEDING AND 1 FOLLOWING);
+----+------+--------+---------------------+----------+
| id | name | amount | time | avg_sum |
+----+------+--------+---------------------+----------+
| 5 | a1 | 300 | 2019-01-04 00:00:00 | 300.0000 |
| 4 | a1 | 300 | 2019-01-03 00:00:00 | 266.6667 |
| 2 | a1 | 200 | 2019-01-02 00:00:00 | 266.6667 |
| 3 | a1 | 300 | 2019-01-02 00:00:00 | 200.0000 |
| 1 | a1 | 100 | 2019-01-01 00:00:00 | 200.0000 |
| 8 | a2 | 600 | 2019-01-07 00:00:00 | 600.0000 |
| 9 | a2 | 600 | 2019-01-07 00:00:00 | 600.0000 |
| 10 | a2 | 600 | 2019-01-07 00:00:00 | 600.0000 |
| 7 | a2 | 600 | 2019-01-06 00:00:00 | 566.6667 |
| 6 | a2 | 500 | 2019-01-05 00:00:00 | 550.0000 |
+----+------+--------+---------------------+----------+
10 rows in set (0.00 sec)
该案例是以name为分组,时间进行降序排序,以分组内当前记录的前一条和后一条为函数计算的范围进行求平均。
从结果可以看出,id为5的记录属于边界值,没有前一行,因此avg_sum为(300+300)/2=300;id为4的记录前后都有记录,所以avg_sum为(300+300+200)/3=266.6667,以此类推就可以得到一个基于滑动窗口的动态平均值。此例中,窗口函数用到了传统的聚合函数avg(),用来计算动态的平均值。
MySQL [test]> select CUME_DIST() over (partition by name order by time desc ) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 0.2 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 0.4 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 0.8 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 0.8 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 1 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 0.6 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.6 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.6 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.8 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 1 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
根据name进行分组,根据时间进行降序排列,统计当前时间(天)的行数与分组内总行数的占比。
MySQL [test]> select DENSE_RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 3 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 3 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 4 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 3 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.01 sec)
以name分块,时间进行降序排序,进行序号编号,dense_rank()的出现是为了解决rank()编号存在的问题的,rank()编号的时候存在跳号的问题,如果有两个并列第1,那么下一个名次的编号就是3,结果就是没有编号为2的数据。如果不想跳号,可以使用dense_rank()替代。
MySQL [test]> select FIRST_VALUE(time) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| 2019-01-04 00:00:00 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-04 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-04 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-04 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-04 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-07 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
以name分块,时间降序排列,获取最大时间为新生成的列的值。
first_value就是取某一组数据,按照某种方式排序的,最早的一个字段的值。
MySQL [test]> select LAG(time,1) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| NULL | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-04 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-03 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| NULL | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-06 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
以name分块,时间降序排列,当前行的上一行的时间作为新属性的值。
lag(column,n)获取当前数据行按照某种排序规则的上n行数据的某个字段。
MySQL [test]> select LAST_VALUE(time) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| 2019-01-04 00:00:00 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-03 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-02 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-01 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-06 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-05 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
last_value就是取某一组数据,按照某种方式排序的,最新的一个字段的值。
MySQL [test]> select LEAD(time,1) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| 2019-01-03 00:00:00 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-02 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-02 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-01 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| NULL | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-06 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-05 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| NULL | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
lead(column,n)获取当前数据行按照某种排序规则的下n行数据的某个字段。
MySQL [test]> select NTH_VALUE(time,2) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| NULL | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-03 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-03 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-03 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-03 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-07 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
从排序的第n行还是返回nth_value字段中的值。
MySQL [test]> select NTILE(2) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 1 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 1 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
按照某列的倒序排列,将字段分成N组,可以得到哪个数据在N组中哪一部分。
MySQL [test]> select PERCENT_RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 0 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 0.25 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 0.5 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 0.5 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 1 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 0 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 0 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 0 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.75 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 1 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
数据分布的计算方式:当前RANK值-1/总行数-1
MySQL [test]> select RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 3 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 3 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 5 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 4 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 5 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
排序条件一样的情况下,其编号也一样。
MySQL [test]> select ROW_NUMBER() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 3 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 4 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 5 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 3 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 4 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 5 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
对排序结果编号。
五、总结
以前觉得Oracle窗口函数方便,Mysqly要实现相同功能比较麻烦,但MySQL8.0中加入了窗口函数的功能,这一点方便了SQL的编写,是MySQL8.0的亮点之一。