数据库-玩转数据-Mysql8窗口函数

一、说明

窗口函数,也叫分析函数,这个功能在大多数据库中早已支持,Mysql8.0也对窗口函数的支持。

窗口的概念 非常重要,它可以理解为记录集合,窗口函数也就是在满足某种条件的记录集合上执行的特殊函数。对于每条记录都要在此窗口内执行函数,有的函数随着记录不同,窗口大小都是固定的,这种属于静态窗口;有的函数则相反,不同的记录对应着不同的窗口,这种动态变化的窗口叫滑动窗口

简单的说窗口函数就是对于查询的每一行,都使用与该行相关的行进行计算。

窗口函数和普通聚合函数很容易混淆,二者区别如下:

聚合函数是将多条记录聚合为一条;而窗口函数是每条记录都会执行,有几条记录执行完还是几条。
聚合函数也可以用于窗口函数中。

二、窗口函数功能

名称描述
CUME_DIST()计算一组值中一个值的累积分布
DENSE_RANK()根据该ORDER BY子句为分区中的每一行分配一个等级。它将相同的等级分配给具有相等值的行。如果两行或更多行具有相同的排名,则排名值序列中将没有间隙
FIRST_VALUE()返回相对于窗口框架第一行的指定表达式的值
LAG()返回分区中当前行之前的第N行的值。如果不存在前一行,则返回NULL
LAST_VALUE()返回相对于窗口框架中最后一行的指定表达式的值
LEAD()返回分区中当前行之后的第N行的值。如果不存在后续行,则返回NULL
NTH_VALUE()从窗口框架的第N行返回参数的值
NTILE()将每个窗口分区的行分配到指定数量的排名组中
PERCENT_RANK()计算分区或结果集中行的百分数等级
RANK()与DENSE_RANK()函数相似,不同之处在于当两行或更多行具有相同的等级时,等级值序列中存在间隙
ROW_NUMBER()为分区中的每一行分配一个顺序整数将上述函数按照功能划分,可以把MySQL支持的

将以上窗口函数分为如下几类:

在这里插入图片描述

三、窗口函数语法

窗口函数的相关语法是:

[WINDOW window_name AS (window_spec)
    [, window_name AS (window_spec)] ...]

window_function_name(window_name/expression) 
    OVER (
        [partition_defintion]
        [order_definition]
        [frame_definition]
    )

四、窗口函数示例

MySQL [test]> select * from t1;
+----+------+--------+---------------------+
| id | name | amount | time                |
+----+------+--------+---------------------+
|  1 | a1   |    100 | 2019-01-01 00:00:00 |
|  2 | a1   |    200 | 2019-01-02 00:00:00 |
|  3 | a1   |    300 | 2019-01-02 00:00:00 |
|  4 | a1   |    300 | 2019-01-03 00:00:00 |
|  5 | a1   |    300 | 2019-01-04 00:00:00 |
|  6 | a2   |    500 | 2019-01-05 00:00:00 |
|  7 | a2   |    600 | 2019-01-06 00:00:00 |
|  8 | a2   |    600 | 2019-01-07 00:00:00 |
|  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 10 | a2   |    600 | 2019-01-07 00:00:00 |
+----+------+--------+---------------------+
10 rows in set (0.00 sec)
MySQL [test]> select id,name,amount,time,avg(amount) over w as avg_sum from t1 window w as (partition by name order by time desc rows BETWEEN 1 PRECEDING AND 1 FOLLOWING);
+----+------+--------+---------------------+----------+
| id | name | amount | time                | avg_sum  |
+----+------+--------+---------------------+----------+
|  5 | a1   |    300 | 2019-01-04 00:00:00 | 300.0000 |
|  4 | a1   |    300 | 2019-01-03 00:00:00 | 266.6667 |
|  2 | a1   |    200 | 2019-01-02 00:00:00 | 266.6667 |
|  3 | a1   |    300 | 2019-01-02 00:00:00 | 200.0000 |
|  1 | a1   |    100 | 2019-01-01 00:00:00 | 200.0000 |
|  8 | a2   |    600 | 2019-01-07 00:00:00 | 600.0000 |
|  9 | a2   |    600 | 2019-01-07 00:00:00 | 600.0000 |
| 10 | a2   |    600 | 2019-01-07 00:00:00 | 600.0000 |
|  7 | a2   |    600 | 2019-01-06 00:00:00 | 566.6667 |
|  6 | a2   |    500 | 2019-01-05 00:00:00 | 550.0000 |
+----+------+--------+---------------------+----------+
10 rows in set (0.00 sec)

该案例是以name为分组,时间进行降序排序,以分组内当前记录的前一条和后一条为函数计算的范围进行求平均。

从结果可以看出,id为5的记录属于边界值,没有前一行,因此avg_sum为(300+300)/2=300;id为4的记录前后都有记录,所以avg_sum为(300+300+200)/3=266.6667,以此类推就可以得到一个基于滑动窗口的动态平均值。此例中,窗口函数用到了传统的聚合函数avg(),用来计算动态的平均值。

MySQL [test]> select CUME_DIST() over (partition by name order by time desc ) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|     0.2 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|     0.4 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|     0.8 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|     0.8 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       1 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|     0.6 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|     0.6 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|     0.6 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|     0.8 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       1 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)

根据name进行分组,根据时间进行降序排列,统计当前时间(天)的行数与分组内总行数的占比。

MySQL [test]> select DENSE_RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|       1 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|       2 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|       3 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|       3 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       4 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|       1 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|       2 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       3 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.01 sec)

以name分块,时间进行降序排序,进行序号编号,dense_rank()的出现是为了解决rank()编号存在的问题的,rank()编号的时候存在跳号的问题,如果有两个并列第1,那么下一个名次的编号就是3,结果就是没有编号为2的数据。如果不想跳号,可以使用dense_rank()替代。

MySQL [test]> select FIRST_VALUE(time) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun             | id | name | amount | time                |
+---------------------+----+------+--------+---------------------+
| 2019-01-04 00:00:00 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
| 2019-01-04 00:00:00 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
| 2019-01-04 00:00:00 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
| 2019-01-04 00:00:00 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
| 2019-01-04 00:00:00 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
| 2019-01-07 00:00:00 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)

以name分块,时间降序排列,获取最大时间为新生成的列的值。
first_value就是取某一组数据,按照某种方式排序的,最早的一个字段的值。

MySQL [test]> select LAG(time,1) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun             | id | name | amount | time                |
+---------------------+----+------+--------+---------------------+
| NULL                |  5 | a1   |    300 | 2019-01-04 00:00:00 |
| 2019-01-04 00:00:00 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
| 2019-01-03 00:00:00 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
| NULL                |  8 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
| 2019-01-06 00:00:00 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)

以name分块,时间降序排列,当前行的上一行的时间作为新属性的值。

lag(column,n)获取当前数据行按照某种排序规则的上n行数据的某个字段。

MySQL [test]> select LAST_VALUE(time) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun             | id | name | amount | time                |
+---------------------+----+------+--------+---------------------+
| 2019-01-04 00:00:00 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
| 2019-01-03 00:00:00 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
| 2019-01-02 00:00:00 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
| 2019-01-01 00:00:00 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-06 00:00:00 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
| 2019-01-05 00:00:00 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)

last_value就是取某一组数据,按照某种方式排序的,最新的一个字段的值。

MySQL [test]> select LEAD(time,1) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun             | id | name | amount | time                |
+---------------------+----+------+--------+---------------------+
| 2019-01-03 00:00:00 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
| 2019-01-02 00:00:00 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
| 2019-01-02 00:00:00 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
| 2019-01-01 00:00:00 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
| NULL                |  1 | a1   |    100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-06 00:00:00 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-05 00:00:00 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
| NULL                |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)

lead(column,n)获取当前数据行按照某种排序规则的下n行数据的某个字段。

MySQL [test]> select NTH_VALUE(time,2) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun             | id | name | amount | time                |
+---------------------+----+------+--------+---------------------+
| NULL                |  5 | a1   |    300 | 2019-01-04 00:00:00 |
| 2019-01-03 00:00:00 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
| 2019-01-03 00:00:00 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
| 2019-01-03 00:00:00 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
| 2019-01-03 00:00:00 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
| 2019-01-07 00:00:00 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)

从排序的第n行还是返回nth_value字段中的值。

MySQL [test]> select NTILE(2) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|       1 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|       1 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|       1 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|       2 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       2 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|       1 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|       2 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       2 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)

按照某列的倒序排列,将字段分成N组,可以得到哪个数据在N组中哪一部分。

MySQL [test]> select PERCENT_RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|       0 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|    0.25 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|     0.5 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|     0.5 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       1 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|       0 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|       0 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|       0 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|    0.75 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       1 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)

数据分布的计算方式:当前RANK值-1/总行数-1

MySQL [test]> select RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|       1 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|       2 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|       3 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|       3 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       5 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|       1 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|       4 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       5 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)

排序条件一样的情况下,其编号也一样。

MySQL [test]> select ROW_NUMBER() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|       1 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|       2 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|       3 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|       4 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       5 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|       1 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|       2 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|       3 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|       4 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       5 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)

对排序结果编号。

五、总结

以前觉得Oracle窗口函数方便,Mysqly要实现相同功能比较麻烦,但MySQL8.0中加入了窗口函数的功能,这一点方便了SQL的编写,是MySQL8.0的亮点之一。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值