Mysql -窗口函数详解

数据分析|MySQL窗口函数详解与案例实战

1. 窗口函数定义与分类

2. 各类别窗口函数用法详解

2.1 序号函数:ROW_NUMBER()、RANK()、DENSE_RANK()

2.2 分布函数:PERCENT_RANK()、CUME_DIST()

2.3 前后函数:LAG(expr,n)、LEAD(expr,n)

2.4 头尾函数:FIRST_VALUE(expr)、LAST_VALUE(expr)

2.5 其它函数:NTH_VALUE(expr, n)、NTILE(n)

2.6 聚合函数作为窗口函数:SUM()、AVG()、MAX()、MIN()、COUNT()

2.7 Frame子句

3. 窗口函数实战

3.1 基础数据准备

3.2 窗口函数用法实战

4. 相关业务问题

4.1 解决方法

4.2 TOPN问题模板

4.3 聚合函数作为窗口函数

5. 总结


 

 

数据分析|MySQL窗口函数详解与案例实战

 

1. 窗口函数定义与分类

(1)定义

窗口函数也称OLAP函数(Online Anallytical Processing,联机分析处理),可对数据进行实时分析处理。

窗口表示记录集合;窗口函数表示在满足某些条件的记录集合上执行的特殊函数,对于每条记录都要在此窗口内执行函数;有的函数随着记录的不同,窗口大小都是固定的,称为静态窗口;有的函数则相反,不同的记录对应着不同的窗口,称为滑动窗口

 

(2)窗口函数与普通函数的区别与联系

①聚合函数是将多条记录聚合为一条;窗口函数是每条记录都会执行,有几条记录执行完还是几条。

②聚合函数也可以用于窗口函数。

 

(3)基本用法

<窗口函数> over ([partition by <列清单>]
                        order by <排序用列清单>)

over关键字用来指定函数执行的窗口范围,若后面括号什么都不写,则意味着窗口包含满足WHERE条件的所有行,窗口函数基于所有行进行计算;如不为空,则支持以下四种语法来设置窗口。

①window_name:给窗口指定一个别名。如果SQL中涉及的窗口较多,采用别名可以看起来更清晰易读;

②PARTITION BY 子句:窗口按照哪些字段进行分组,窗口函数在不同的分组上分别执行;

③ORDER BY子句:按照哪些字段进行排序,窗口函数将按照排序后的记录顺序进行编号;

④FRAME子句:FRAME是当前分区的一个子集,子句用来定义子集的规则,通常用来作为滑动窗口使用。
 

(4)窗口函数分类                             

   图一:常用窗口函数汇总
图一:常用窗口函数汇总

 

2. 各类别窗口函数用法详解

2.1 序号函数:ROW_NUMBER()、RANK()DENSE_RANK()

① ROW_NUMBER()

  • 用途:显示分区中的当前行号
  • 应用场景:查询每个学生分数最高的3门课程
SELECT * FROM (
     SELECT stu_id,
     ROW_NUMBER() OVER (PARTITION BY stu_id ORDER BY score DESC) AS score_order,
     lesson_id, score
     FROM t_score) t
 WHERE score_order <= 3;

+--------+-------------+-----------+-------+
| stu_id | score_order | lesson_id | score |
+--------+-------------+-----------+-------+
|      1 |           1 | L005      |    98 |
|      1 |           2 | L001      |    98 |
|      1 |           3 | L004      |    88 |
|      2 |           1 | L002      |    90 |
|      2 |           2 | L003      |    86 |
|      2 |           3 | L001      |    84 |
|      3 |           1 | L001      |   100 |
|      3 |           2 | L002      |    91 |
|      3 |           3 | L003      |    85 |
|      4 |           1 | L001      |    99 |
|      4 |           2 | L005      |    98 |
|      4 |           3 | L002      |    88 |
+--------+-------------+-----------+-------+
  • 解释:row_number()是排序函数,partition by将学号分类,相当于group by子句功能,但group by子句分类汇总会改变原数据行数,而窗口函数保持原行数,order by将成绩降序排列。对于stu_id=1的同学,有两门课程的成绩均为98,序号随机排了1和2。但很多情况下二者应该是并列第一,则他的成绩为88的这门课的序号可能是第2名,也可能为第3名。这时ROW_NUMBER()就不能满足需求,需要使用RANK()和DENSE_RANK(),它们和ROW_NUMBER()非常类似,只是在出现重复值时处理逻辑有所不同。
  • 注意:窗口函数是对where或group by处理后结果进行操作,因此按SQL语句运行顺序,窗口函数一般放在select子句中,不能在where子句中使用窗口函数别名。插播“Mysql语句执行顺序

② RANK()DENSE_RANK()

rank()/dense_rank():与ROW_NUMBER()非常类似,只是在出现重复值时处理逻辑有所不同。下面我们查看学号stu_id=1使用不同序号函数的排序结果。

SELECT *
FROM (
         SELECT ROW_NUMBER() OVER (PARTITION BY stu_id ORDER BY score DESC) AS score_row_number,
                RANK() OVER (PARTITION BY stu_id ORDER BY score DESC)       AS score_rank,
                DENSE_RANK() OVER (PARTITION BY stu_id ORDER BY score DESC) AS score_dense_rank,
                stu_id,
                lesson_lid,
                score
         FROM t_score) t
WHERE stu_id = 1
  AND score_order1 <= 3
  AND score_order2 <= 3
  AND score_order3 <= 3
;

+------------------+------------+------------------+--------+-----------+-------+
| score_row_number | score_rank | score_dense_rank | stu_id | lesson_id | score |
+------------------+------------+------------------+--------+-----------+-------+
|                1 |          1 |                1 |      1 | L005      |    98 |
|                2 |          1 |                1 |      1 | L001      |    98 |
|                3 |          3 |                2 |      1 | L004      |    88 |
+------------------+------------+------------------+--------+-----------+-------+

当成绩相同时,会存在并列情况,主要区别是三个函数如何处理并列情况:

  • 在rank()函数,如果有并列情况,会占用下一个名次,如:1,1,3;
  • 在dense_rank()函数中,如果有并列的情况,不会占用下一个名次,如:1,1,2;
  • 在row_number()函数中,会忽略并列的情况,依次排序,名次是:1,2,3;

 

2.2 分布函数:PERCENT_RANK()CUME_DIST()

① PERCENT_RANK()

  • 用途:每行按照公式(rank-1) / (rows-1)进行计算。其中,rankRANK()函数产生的序号,rows为当前窗口的记录总行数
  • 应用场景:不常用
SELECT RANK() OVER w         AS rk,
       PERCENT_RANK() OVER w AS prk,
       stu_id,
       lesson_id,
       score
FROM t_score
WHERE stu_id = 1
    WINDOW w AS (PARTITION BY stu_id ORDER BY score)
;

+----+------+--------+-----------+-------+
| rk | prk  | stu_id | lesson_id | score |
+----+------+--------+-----------+-------+
|  1 |    0 |      1 | L003      |    79 |
|  2 | 0.25 |      1 | L002      |    86 |
|  3 |  0.5 |      1 | L004      |    88 |
|  4 | 0.75 |      1 | L005      |    98 |
|  4 | 0.75 |      1 | L001      |    98 |
+----+------+--------+-----------+-------+

② CUME_DIST()

  • 用途:分组内小于、等于当前rank值的行数 / 分组内总行数
  • 应用场景:查询小于等于当前成绩(score)的比例
SELECT stu_id,
       lesson_id,
       score,
       CUME_DIST() OVER (ORDER BY score)                        AS cd1,
       CUME_DIST() OVER (PARTITION BY lesson_id ORDER BY score) AS cd2
FROM t_score
WHERE lesson_id IN ('L001', 'L002')
;

+--------+-----------+-------+-------+------+
| stu_id | lesson_id | score | cd1   | cd2  |
+--------+-----------+-------+-------+------+
|      2 | L001      |    84 | 0.125 | 0.25 |
|      1 | L001      |    98 |  0.75 |  0.5 |
|      4 | L001      |    99 | 0.875 | 0.75 |
|      3 | L001      |   100 |     1 |    1 |
|      1 | L002      |    86 |  0.25 | 0.25 |
|      4 | L002      |    88 | 0.375 |  0.5 |
|      2 | L002      |    90 |   0.5 | 0.75 |
|      3 | L002      |    91 | 0.625 |    1 |
+--------+-----------+-------+-------+------+

其中,cd1将整张表作为窗口,按成绩排序(默认升序),排序后cd1列的值依次为0-1;然后cd2按照lesson_id分组score排序,打乱了cd1的排列顺序,得到上图结果。

 

2.3 前后函数:LAG(expr,n)LEAD(expr,n)

  • 用途:返回位于当前行的前n行(LAG(expr,n))或后n行(LEAD(expr,n))的expr的值
  • 应用场景:查询前1名同学的成绩和当前同学成绩的差值

内层SQL先通过LAG()函数得到前1名同学的成绩,外层SQL再将当前同学和前1名同学的成绩做差得到成绩差值diff

SELECT stu_id, lesson_id, score, pre_score,score-pre_score AS diff
    FROM(
     SELECT stu_id, lesson_id, score,
     LAG(score,1) OVER w AS pre_score
     FROM t_score
     WHERE lesson_id IN ('L001','L002')
     WINDOW w AS (PARTITION BY lesson_id ORDER BY score)) t;

+--------+-----------+-------+-----------+------+
| stu_id | lesson_id | score | pre_score | diff |
+--------+-----------+-------+-----------+------+
|      2 | L001      |    84 |      NULL | NULL |
|      1 | L001      |    98 |        84 |   14 |
|      4 | L001      |    99 |        98 |    1 |
|      3 | L001      |   100 |        99 |    1 |
|      1 | L002      |    86 |      NULL | NULL |
|      4 | L002      |    88 |        86 |    2 |
|      2 | L002      |    90 |        88 |    2 |
|      3 | L002      |    91 |        90 |    1 |
+--------+-----------+-------+-----------+------+

 

2.4 头尾函数:FIRST_VALUE(expr)LAST_VALUE(expr)

  • 用途:返回第一个(FIRST_VALUE(expr))或最后一个(LAST_VALUE(expr))expr的值
  • 应用场景:截止到当前成绩,按照日期排序查询第1个和最后1个同学的分数
- 添加新列:
ALTER TABLE t_score ADD create_time DATE;
SELECT stu_id, lesson_id, score, create_time,
    FIRST_VALUE(score) OVER w AS first_score,
    LAST_VALUE(score) OVER w AS last_score
    FROM t_score
    WHERE lesson_id IN ('L001','L002')
    WINDOW w AS (PARTITION BY lesson_id ORDER BY create_time);

+--------+-----------+-------+-------------+-------------+------------+
| stu_id | lesson_id | score | create_time | first_score | last_score |
+--------+-----------+-------+-------------+-------------+------------+
|      3 | L001      |   100 | 2018-08-07  |         100 |        100 |
|      1 | L001      |    98 | 2018-08-08  |         100 |         98 |
|      2 | L001      |    84 | 2018-08-09  |         100 |         84 |
|      4 | L001      |    99 | 2018-08-09  |         100 |         99 |
|      3 | L002      |    91 | 2018-08-07  |          91 |         91 |
|      1 | L002      |    86 | 2018-08-08  |          91 |         86 |
|      2 | L002      |    90 | 2018-08-09  |          91 |         90 |
|      4 | L002      |    88 | 2018-08-10  |          91 |         88 |
+--------+-----------+-------+-------------+-------------+------------+

其中,FIRST_VALUE取得的是当前窗口的第一行,LAST_VALUE取得的是当前窗口到此条记录的最后一行,即当前记录;若想取得当前窗口的最后一行记录,需添加frame子句,rows between unbounded preceding and unbounded following。

 

2.5 其它函数:NTH_VALUE(expr, n)NTILE(n)

① NTH_VALUE(expr,n)

  • 用途:返回窗口中第n个expr的值。expr可以是表达式,也可以是列名
  • 应用场景:截止到当前成绩,显示每个同学的成绩中排名第2和第3的成绩的分数
SELECT stu_id, lesson_id, score,
    NTH_VALUE(score,2) OVER w AS second_score,
    NTH_VALUE(score,3) OVER w AS third_score
    FROM t_score
    WHERE stu_id IN (1,2)
    WINDOW w AS (PARTITION BY stu_id ORDER BY score);

+--------+-----------+-------+--------------+-------------+
| stu_id | lesson_id | score | second_score | third_score |
+--------+-----------+-------+--------------+-------------+
|      1 | L003      |    79 |         NULL |        NULL |
|      1 | L002      |    86 |           86 |        NULL |
|      1 | L004      |    88 |           86 |          88 |
|      1 | L001      |    98 |           86 |          88 |
|      1 | L005      |    98 |           86 |          88 |
|      2 | L004      |    75 |         NULL |        NULL |
|      2 | L005      |    77 |           77 |        NULL |
|      2 | L001      |    84 |           77 |          84 |
|      2 | L003      |    86 |           77 |          84 |
|      2 | L002      |    90 |           77 |          84 |
+--------+-----------+-------+--------------+-------------+

② NTILE(n):分桶

  • 用途:将分区中的有序数据分为n个等级,记录等级数
  • 应用场景:将每门课程按照成绩分成3组
SELECT
    NTILE(3) OVER w AS nf,
    stu_id, lesson_id, score
    FROM t_score
    WHERE lesson_id IN ('L001','L002')
    WINDOW w AS (PARTITION BY lesson_id ORDER BY score);

+------+--------+-----------+-------+
| nf   | stu_id | lesson_id | score |
+------+--------+-----------+-------+
|    1 |      2 | L001      |    84 |
|    1 |      1 | L001      |    98 |
|    2 |      4 | L001      |    99 |
|    3 |      3 | L001      |   100 |
|    1 |      1 | L002      |    86 |
|    1 |      4 | L002      |    88 |
|    2 |      2 | L002      |    90 |
|    3 |      3 | L002      |    91 |
+------+--------+-----------+-------+

NTILE(n)函数在数据分析中应用较多,比如由于数据量大,需要将数据平均分配到n个并行的进程分别计算,此时就可以用NTILE(n)对数据进行分组(由于记录数不一定被n整除,所以数据不一定完全平均),然后将不同桶号的数据再分配。

 

2.6 聚合函数作为窗口函数:SUM()AVG()MAX()MIN()COUNT()

  • 用途:在窗口中每条记录动态地应用聚合函数(SUM()AVG()MAX()MIN()COUNT()),可以动态计算在指定的窗口内的各种聚合函数值
  • 应用场景:截止到当前时间,学生stu_id=1的累计分数/平均分数/最高分数/最低分数/成绩记录条数
SELECT stu_id, lesson_id, score, create_time,
    SUM(score) OVER w AS score_sum,
    MAX(score) OVER w AS score_max,
    MIN(score) OVER w AS score_min
    FROM t_score
    WHERE stu_id = 1
    WINDOW w AS (PARTITION BY stu_id ORDER BY create_time);

+--------+-----------+-------+-------------+-----------+-----------+-----------+
| stu_id | lesson_id | score | create_time | score_sum | score_max | score_min |
+--------+-----------+-------+-------------+-----------+-----------+-----------+
|      1 | L001      |    98 | 2018-08-08  |       184 |        98 |        86 |
|      1 | L002      |    86 | 2018-08-08  |       184 |        98 |        86 |
|      1 | L003      |    79 | 2018-08-09  |       263 |        98 |        79 |
|      1 | L004      |    88 | 2018-08-10  |       449 |        98 |        79 |
|      1 | L005      |    98 | 2018-08-10  |       449 |        98 |        79 |
+--------+-----------+-------+-------------+-----------+-----------+-----------+

注意:sum作为窗口函数,当求和字段与排序字段相同时,如果有重复数据,会直接把相同的数据相加,并不是逐步相加。为什么呢?

 

2.7 Frame子句

frame是当前分区的一个子集,子句用来定义子集的规则,通常用来作为滑动窗口使用。比如要根据每个订单动态计算包括本订单和按时间顺序前后两个订单的平均订单金额,则可以设置如下frame子句来创建滑动窗口:

select * from(
    select order_id,user_no,amount,
    avg(amount) over w as avg_num,
    create_date
    from order_tabl
    window w as (partition by user_no order by create_date desc rows between 1 preceding and 1 following)
)t;

+----------+---------+--------+-----------+---------------------+
| order_id | user_no | amount |   avg_num |         create_date |
+----------+-----------+-------+-------- -+---------------------+
|        5 | 001     |    900 | 850.0000  | 2018-01-04 00:00:00 |
|        4 | 001     |    800 | 666.6667  | 2018-01-03 00:00:00 | 
|        2 | 001     |    300 | 533.3333  | 2018-01-02 00:00:00 |
|        3 | 001     |    500 | 300.0000  | 2018-01-02 00:00:00 |
|        1 | 001     |    100 | 300.0000  | 2018-01-01 00:00:00 |
|       10 | 002     |    800 | 800.0000  | 2018-01-22 00:00:00 |
|        9 | 002     |    800 | 633.3333  | 2018-01-16 00:00:00 |
|        8 | 002     |    300 | 566.6667  | 2018-01-10 00:00:00 |
|        7 | 002     |    600 | 466.6667  | 2018-01-04 00:00:00 |
|        6 | 002     |    500 | 550.0000  | 2018-01-03 00:00:00 |
+----------+---------+--------+-----------+---------------------+

从结果可以看出,order_id为5订单属于边界值,没有前一行,因此平均订单金额为(900+800)/2=850;order_id为4的订单前后都有订单,所以平均订单金额为(900+800+300)/3=666.6667,以此类推就可以得到一个基于滑动窗口的动态平均订单值。

对于滑动窗口的范围指定,有两种方式,基于行和基于范围,具体区别如下:

① 基于行:通常使用BETWEEN frame_start AND frame_end语法来表示行范围,frame_start和frame_end可以支持如下关键字,来确定不同的动态行记录:

  • CURRENT ROW 边界是当前行,一般和其他范围关键字一起使用
  • UNBOUNDED PRECEDING 边界是分区中的第一行
  • UNBOUNDED FOLLOWING 边界是分区中的最后一行
  • expr PRECEDING  边界是当前行减去expr的值
  • expr FOLLOWING  边界是当前行加上expr的值
  • rows BETWEEN 1 PRECEDING AND 1 FOLLOWING 窗口范围是当前行、前一行、后一行一共三行记录。
  • rows UNBOUNDED FOLLOWING 窗口范围是当前行到分区中的最后一行。
  • rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 窗口范围是当前分区中所有行,等同于不写。(last_value的窗口范围是窗口首行-当前行!即rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

② 基于范围:有些范围不可直接用行数来表示,比如窗口范围是一周前订单开始,截止到当前行,无法使用rows来表示,此时就可以使用范围来表示窗口:INTERVAL 7 DAY PRECEDING。Linux中常见的最近1分钟、5分钟负载是一个典型的应用场景。有的函数不管有没有frame子句,它的窗口都是固定的,也就是前面介绍的静态窗口,这些函数包括如下(排序/前后/分布等函数):

  • DENSE_RANK()、RANK()、ROW_NUMBER()
  • LAG()、LEAD()
  • PERCENT_RANK()、CUME_DIST()
  • NTILE()

接下来我们以会员订单表dm_order_2018为例,来介绍每个函数的使用方法。表里包含了如下字段:订单id,订单金额,会员id,订单创建时间,订单状态等。

 

3. 窗口函数实战

3.1 基础数据准备

(1)建表

CREATE TABLE dm_order_2018
(
    id          varchar(100) DEFAULT NULL COMMENT '订单id',
    order_money float        DEFAULT NULL COMMENT '订单金额',
    member_id   int(11)      DEFAULT NULL COMMENT '会员id',
    create_time timestamp    DEFAULT NULL COMMENT '创建时间',
    status      int(11)      DEFAULT NULL COMMENT '订单状态|0:取消,1:待支付,2:付款成功'
);

(2)插入数据

INSERT INTO dm_order_2018 VALUES ('A001', '100.15', '1', '2018-01-01 13:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A023', '100.15', '1', '2018-01-01 15:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A002', '100', '2', '2018-04-01 14:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A003', '12.1', '3', '2018-01-01 13:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A004', '200.15', '4', '2018-01-01 13:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A005', '1200.15', '5', '2018-01-01 13:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A006', '0.15', '11', '2018-01-01 17:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A007', '1215.1', '31', '2018-01-01 10:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A008', '100.75', '15', '2018-01-01 14:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A009', '100.15', '8', '2018-01-01 15:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A010', '20.15', '9', '2018-01-01 16:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A011', '110.15', '21', '2018-01-01 14:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A012', '1.15', '13', '2018-01-01 13:03:12', '2');
INSERT INTO dm_order_2018 VALUES ('A013', '20.15', '14', '2018-02-01 13:53:12', '2');
INSERT INTO dm_order_2018 VALUES ('A014', '30.15', '15', '2018-03-01 13:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A015', '40.5', '13', '2018-01-01 13:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A016', '65', '1', '2017-01-01 13:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A017', '78.15', '1', '2017-03-01 09:23:12', '1');
INSERT INTO dm_order_2018 VALUES ('A018', '88.15', '1', '2017-11-01 20:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A019', '99', '12', '2018-02-11 13:23:12', '1');
INSERT INTO dm_order_2018 VALUES ('A020', '10.3', '13', '2018-04-01 13:23:12', '2');
INSERT INTO dm_order_2018 VALUES ('A021', '600', '5', '2018-11-01 08:58:31', '2');
INSERT INTO dm_order_2018 VALUES ('A022', '500', '31', '2018-11-11 08:59:02', '2');

(3)查看数据

select * FROM dm_order_2018;

+----+-----------+---------+-------------------+------+
|id  |order_money|member_id|create_time        |status|
+----+-----------+---------+-------------------+------+
|A022|500        |31       |2018-11-11 08:59:02|2     |
|A001|100.15     |1        |2018-01-01 13:23:12|2     |
|A023|100.15     |1        |2018-01-01 15:23:12|2     |
|A002|100        |2        |2018-04-01 14:23:12|2     |
|A003|12.1       |3        |2018-01-01 13:23:12|2     |
|A004|200.15     |4        |2018-01-01 13:23:12|2     |
|A005|1200.15    |5        |2018-01-01 13:23:12|2     |
|A006|0.15       |11       |2018-01-01 17:23:12|2     |
|A007|1215.1     |31       |2018-01-01 10:23:12|2     |
|A008|100.75     |15       |2018-01-01 14:23:12|2     |
|A009|100.15     |8        |2018-01-01 15:23:12|2     |
|A010|20.15      |9        |2018-01-01 16:23:12|2     |
|A011|110.15     |21       |2018-01-01 14:23:12|2     |
|A012|1.15       |13       |2018-01-01 13:03:12|2     |
|A013|20.15      |14       |2018-02-01 13:53:12|2     |
|A014|30.15      |15       |2018-03-01 13:23:12|2     |
|A015|40.5       |13       |2018-01-01 13:23:12|2     |
|A016|65         |1        |2017-01-01 13:23:12|2     |
|A017|78.15      |1        |2017-03-01 09:23:12|1     |
|A018|88.15      |1        |2017-11-01 20:23:12|2     |
|A019|99         |12       |2018-02-11 13:23:12|1     |
|A020|10.3       |13       |2018-04-01 13:23:12|2     |
|A021|600        |5        |2018-11-01 08:58:31|2     |
+----+-----------+---------+-------------------+------+

 

3.2 窗口函数用法实战

(1)序号函数:row_number() / rank() / dense_rank()

查询排序小于等于3的订单:

select *
FROM (
         select id,
                order_money,
                member_id,
                create_time,
                row_number() over (partition by member_id ORDER BY order_money desc) as row_num
         from dm_order_2018
     ) t
where row_num <= 3;

+----+-----------+---------+-------------------+-------+
|id  |order_money|member_id|create_time        |row_num|
+----+-----------+---------+-------------------+-------+
|A001|100.15     |1        |2018-01-01 13:23:12|1      |
|A023|100.15     |1        |2018-01-01 15:23:12|2      |
|A018|88.15      |1        |2017-11-01 20:23:12|3      |
|A002|100        |2        |2018-04-01 14:23:12|1      |
|A003|12.1       |3        |2018-01-01 13:23:12|1      |
|A004|200.15     |4        |2018-01-01 13:23:12|1      |
|A005|1200.15    |5        |2018-01-01 13:23:12|1      |
|A021|600        |5        |2018-11-01 08:58:31|2      |
|A009|100.15     |8        |2018-01-01 15:23:12|1      |
|A010|20.15      |9        |2018-01-01 16:23:12|1      |
|A006|0.15       |11       |2018-01-01 17:23:12|1      |
|A019|99         |12       |2018-02-11 13:23:12|1      |
|A015|40.5       |13       |2018-01-01 13:23:12|1      |
|A020|10.3       |13       |2018-04-01 13:23:12|2      |
|A012|1.15       |13       |2018-01-01 13:03:12|3      |
|A013|20.15      |14       |2018-02-01 13:53:12|1      |
|A008|100.75     |15       |2018-01-01 14:23:12|1      |
|A014|30.15      |15       |2018-03-01 13:23:12|2      |
|A011|110.15     |21       |2018-01-01 14:23:12|1      |
|A007|1215.1     |31       |2018-01-01 10:23:12|1      |
|A022|500        |31       |2018-11-11 08:59:02|2      |
+----+-----------+---------+-------------------+-------+

多种排序方式:

select *
FROM (select id,
             order_money,
             member_id,
             create_time,
             row_number() over (partition by member_id ORDER BY order_money desc) as row_num1,
             rank() over (partition by member_id ORDER BY order_money desc)       as row_num2,
             dense_rank() over (partition by member_id ORDER BY order_money desc) as row_num3
      from dm_order_2018
     ) t;

+----+-----------+---------+-------------------+--------+--------+--------+
|id  |order_money|member_id|create_time        |row_num1|row_num2|row_num3|
+----+-----------+---------+-------------------+--------+--------+--------+
|A001|100.15     |1        |2018-01-01 13:23:12|1       |1       |1       |
|A023|100.15     |1        |2018-01-01 15:23:12|2       |1       |1       |
|A018|88.15      |1        |2017-11-01 20:23:12|3       |3       |2       |
|A017|78.15      |1        |2017-03-01 09:23:12|4       |4       |3       |
|A016|65         |1        |2017-01-01 13:23:12|5       |5       |4       |
|A002|100        |2        |2018-04-01 14:23:12|1       |1       |1       |
|A003|12.1       |3        |2018-01-01 13:23:12|1       |1       |1       |
|A004|200.15     |4        |2018-01-01 13:23:12|1       |1       |1       |
|A005|1200.15    |5        |2018-01-01 13:23:12|1       |1       |1       |
|A021|600        |5        |2018-11-01 08:58:31|2       |2       |2       |
|A009|100.15     |8        |2018-01-01 15:23:12|1       |1       |1       |
|A010|20.15      |9        |2018-01-01 16:23:12|1       |1       |1       |
|A006|0.15       |11       |2018-01-01 17:23:12|1       |1       |1       |
|A019|99         |12       |2018-02-11 13:23:12|1       |1       |1       |
|A015|40.5       |13       |2018-01-01 13:23:12|1       |1       |1       |
|A020|10.3       |13       |2018-04-01 13:23:12|2       |2       |2       |
|A012|1.15       |13       |2018-01-01 13:03:12|3       |3       |3       |
|A013|20.15      |14       |2018-02-01 13:53:12|1       |1       |1       |
|A008|100.75     |15       |2018-01-01 14:23:12|1       |1       |1       |
|A014|30.15      |15       |2018-03-01 13:23:12|2       |2       |2       |
|A011|110.15     |21       |2018-01-01 14:23:12|1       |1       |1       |
|A007|1215.1     |31       |2018-01-01 10:23:12|1       |1       |1       |
|A022|500        |31       |2018-11-11 08:59:02|2       |2       |2       |
+----+-----------+---------+-------------------+--------+--------+--------+

rank()会产生序号相同的记录,同时可能产生序号间隙,如1,1,3;而dense_rank()也会产生序号相同的记录,但不会产生序号间隙,如1,1,2。

 

(2)分布函数:percent_rank() / cume_dist()

查看数据的分位数:

select *
FROM (select id,
             order_money,
             member_id,
             create_time,
             row_number() over w   as row_num,
             percent_rank() over w as percent
      from dm_order_2018
          window w as (partition by member_id ORDER BY create_time DESC)
     ) t;

+----+-----------+---------+-------------------+-------+-------+
|id  |order_money|member_id|create_time        |row_num|percent|
+----+-----------+---------+-------------------+-------+-------+
|A023|100.15     |1        |2018-01-01 15:23:12|1      |0      |
|A001|100.15     |1        |2018-01-01 13:23:12|2      |0.25   |
|A018|88.15      |1        |2017-11-01 20:23:12|3      |0.5    |
|A017|78.15      |1        |2017-03-01 09:23:12|4      |0.75   |
|A016|65         |1        |2017-01-01 13:23:12|5      |1      |
|A002|100        |2        |2018-04-01 14:23:12|1      |0      |
|A003|12.1       |3        |2018-01-01 13:23:12|1      |0      |
|A004|200.15     |4        |2018-01-01 13:23:12|1      |0      |
|A021|600        |5        |2018-11-01 08:58:31|1      |0      |
|A005|1200.15    |5        |2018-01-01 13:23:12|2      |1      |
|A009|100.15     |8        |2018-01-01 15:23:12|1      |0      |
|A010|20.15      |9        |2018-01-01 16:23:12|1      |0      |
|A006|0.15       |11       |2018-01-01 17:23:12|1      |0      |
|A019|99         |12       |2018-02-11 13:23:12|1      |0      |
|A020|10.3       |13       |2018-04-01 13:23:12|1      |0      |
|A015|40.5       |13       |2018-01-01 13:23:12|2      |0.5    |
|A012|1.15       |13       |2018-01-01 13:03:12|3      |1      |
|A013|20.15      |14       |2018-02-01 13:53:12|1      |0      |
|A014|30.15      |15       |2018-03-01 13:23:12|1      |0      |
|A008|100.75     |15       |2018-01-01 14:23:12|2      |1      |
|A011|110.15     |21       |2018-01-01 14:23:12|1      |0      |
|A022|500        |31       |2018-11-11 08:59:02|1      |0      |
|A007|1215.1     |31       |2018-01-01 10:23:12|2      |1      |
+----+-----------+---------+-------------------+-------+-------+

大于等于当前订单金额的订单比例有多少:

select *
FROM (select id,
             order_money,
             member_id,
             create_time,
             row_number() over w   as row_num,
             percent_rank() over w as percent
      from dm_order_2018
          window w as (partition by member_id ORDER BY create_time DESC)
     ) t;

+----+-----------+---------+-------------------+-------+-------+
|id  |order_money|member_id|create_time        |row_num|percent|
+----+-----------+---------+-------------------+-------+-------+
|A023|100.15     |1        |2018-01-01 15:23:12|1      |0      |
|A001|100.15     |1        |2018-01-01 13:23:12|2      |0.25   |
|A018|88.15      |1        |2017-11-01 20:23:12|3      |0.5    |
|A017|78.15      |1        |2017-03-01 09:23:12|4      |0.75   |
|A016|65         |1        |2017-01-01 13:23:12|5      |1      |
|A002|100        |2        |2018-04-01 14:23:12|1      |0      |
|A003|12.1       |3        |2018-01-01 13:23:12|1      |0      |
|A004|200.15     |4        |2018-01-01 13:23:12|1      |0      |
|A021|600        |5        |2018-11-01 08:58:31|1      |0      |
|A005|1200.15    |5        |2018-01-01 13:23:12|2      |1      |
|A009|100.15     |8        |2018-01-01 15:23:12|1      |0      |
|A010|20.15      |9        |2018-01-01 16:23:12|1      |0      |
|A006|0.15       |11       |2018-01-01 17:23:12|1      |0      |
|A019|99         |12       |2018-02-11 13:23:12|1      |0      |
|A020|10.3       |13       |2018-04-01 13:23:12|1      |0      |
|A015|40.5       |13       |2018-01-01 13:23:12|2      |0.5    |
|A012|1.15       |13       |2018-01-01 13:03:12|3      |1      |
|A013|20.15      |14       |2018-02-01 13:53:12|1      |0      |
|A014|30.15      |15       |2018-03-01 13:23:12|1      |0      |
|A008|100.75     |15       |2018-01-01 14:23:12|2      |1      |
|A011|110.15     |21       |2018-01-01 14:23:12|1      |0      |
|A022|500        |31       |2018-11-11 08:59:02|1      |0      |
|A007|1215.1     |31       |2018-01-01 10:23:12|2      |1      |
+----+-----------+---------+-------------------+-------+-------+

 

(3)前后函数:lag() / lead()

-- 查询上一个订单距离当前订单的时间间隔。
select id, order_money, member_id, create_time, last_time, datediff(create_time, last_time) as diff
FROM (select id,
             order_money,
             member_id,
             create_time,
             lag(create_time, 1) over w as last_time
      from dm_order_2018
          window w as (partition by member_id ORDER BY create_time ASC)
     ) t;

+----+-----------+---------+-------------------+-------------------+----+
|id  |order_money|member_id|create_time        |last_time          |diff|
+----+-----------+---------+-------------------+-------------------+----+
|A016|65         |1        |2017-01-01 13:23:12|NULL               |NULL|
|A017|78.15      |1        |2017-03-01 09:23:12|2017-01-01 13:23:12|59  |
|A018|88.15      |1        |2017-11-01 20:23:12|2017-03-01 09:23:12|245 |
|A001|100.15     |1        |2018-01-01 13:23:12|2017-11-01 20:23:12|61  |
|A023|100.15     |1        |2018-01-01 15:23:12|2018-01-01 13:23:12|0   |
|A002|100        |2        |2018-04-01 14:23:12|NULL               |NULL|
|A003|12.1       |3        |2018-01-01 13:23:12|NULL               |NULL|
|A004|200.15     |4        |2018-01-01 13:23:12|NULL               |NULL|
|A005|1200.15    |5        |2018-01-01 13:23:12|NULL               |NULL|
|A021|600        |5        |2018-11-01 08:58:31|2018-01-01 13:23:12|304 |
|A009|100.15     |8        |2018-01-01 15:23:12|NULL               |NULL|
|A010|20.15      |9        |2018-01-01 16:23:12|NULL               |NULL|
|A006|0.15       |11       |2018-01-01 17:23:12|NULL               |NULL|
|A019|99         |12       |2018-02-11 13:23:12|NULL               |NULL|
|A012|1.15       |13       |2018-01-01 13:03:12|NULL               |NULL|
|A015|40.5       |13       |2018-01-01 13:23:12|2018-01-01 13:03:12|0   |
|A020|10.3       |13       |2018-04-01 13:23:12|2018-01-01 13:23:12|90  |
|A013|20.15      |14       |2018-02-01 13:53:12|NULL               |NULL|
|A008|100.75     |15       |2018-01-01 14:23:12|NULL               |NULL|
|A014|30.15      |15       |2018-03-01 13:23:12|2018-01-01 14:23:12|59  |
|A011|110.15     |21       |2018-01-01 14:23:12|NULL               |NULL|
|A007|1215.1     |31       |2018-01-01 10:23:12|NULL               |NULL|
|A022|500        |31       |2018-11-11 08:59:02|2018-01-01 10:23:12|314 |
+----+-----------+---------+-------------------+-------------------+----+

可以看到,没有匹配到的数值默认以null填充。

 

(4)头尾函数:first_val() / last_val()

-- 查询截止到当前订单,按照日期排序第一个订单和最后一个订单的订单金额。
select *
FROM (select id,
             order_money,
             member_id,
             create_time,
             first_value(order_money) over w as first_money,
             last_value(order_money) over w  as last_money
      from dm_order_2018
          window w as (partition by member_id ORDER BY create_time ASC)
     ) t;

+----+-----------+---------+-------------------+-------------------+-------------------+
|id  |order_money|member_id|create_time        |first_money        |last_money         |
+----+-----------+---------+-------------------+-------------------+-------------------+
|A016|65         |1        |2017-01-01 13:23:12|65                 |65                 |
|A017|78.15      |1        |2017-03-01 09:23:12|65                 |78.1500015258789   |
|A018|88.15      |1        |2017-11-01 20:23:12|65                 |88.1500015258789   |
|A001|100.15     |1        |2018-01-01 13:23:12|65                 |100.1500015258789  |
|A023|100.15     |1        |2018-01-01 15:23:12|65                 |100.1500015258789  |
|A002|100        |2        |2018-04-01 14:23:12|100                |100                |
|A003|12.1       |3        |2018-01-01 13:23:12|12.100000381469727 |12.100000381469727 |
|A004|200.15     |4        |2018-01-01 13:23:12|200.14999389648438 |200.14999389648438 |
|A005|1200.15    |5        |2018-01-01 13:23:12|1200.1500244140625 |1200.1500244140625 |
|A021|600        |5        |2018-11-01 08:58:31|1200.1500244140625 |600                |
|A009|100.15     |8        |2018-01-01 15:23:12|100.1500015258789  |100.1500015258789  |
|A010|20.15      |9        |2018-01-01 16:23:12|20.149999618530273 |20.149999618530273 |
|A006|0.15       |11       |2018-01-01 17:23:12|0.15000000596046448|0.15000000596046448|
|A019|99         |12       |2018-02-11 13:23:12|99                 |99                 |
|A012|1.15       |13       |2018-01-01 13:03:12|1.149999976158142  |1.149999976158142  |
|A015|40.5       |13       |2018-01-01 13:23:12|1.149999976158142  |40.5               |
|A020|10.3       |13       |2018-04-01 13:23:12|1.149999976158142  |10.300000190734863 |
|A013|20.15      |14       |2018-02-01 13:53:12|20.149999618530273 |20.149999618530273 |
|A008|100.75     |15       |2018-01-01 14:23:12|100.75             |100.75             |
|A014|30.15      |15       |2018-03-01 13:23:12|100.75             |30.149999618530273 |
|A011|110.15     |21       |2018-01-01 14:23:12|110.1500015258789  |110.1500015258789  |
|A007|1215.1     |31       |2018-01-01 10:23:12|1215.0999755859375 |1215.0999755859375 |
|A022|500        |31       |2018-11-11 08:59:02|1215.0999755859375 |500                |
+----+-----------+---------+-------------------+-------------------+-------------------+

查询截止到当前订单,按照日期排序第一个订单和最后一个订单的订单金额。

 

(5)滑动窗口示例

select id,
       order_money,
       member_id,
       create_time,
       avg(order_money) over w as avg_num,
       sum(order_money) over w as sum_num
from dm_order_2018
    window w as (partition by member_id order by create_time desc rows between 1 preceding and 1 following)

+----+-----------+---------+-------------------+-------------------+-------------------+
|id  |order_money|member_id|create_time        |avg_num            |sum_num            |
+----+-----------+---------+-------------------+-------------------+-------------------+
|A023|100.15     |1        |2018-01-01 15:23:12|100.1500015258789  |200.3000030517578  |
|A001|100.15     |1        |2018-01-01 13:23:12|96.1500015258789   |288.4500045776367  |
|A018|88.15      |1        |2017-11-01 20:23:12|88.81666819254558  |266.4500045776367  |
|A017|78.15      |1        |2017-03-01 09:23:12|77.10000101725261  |231.3000030517578  |
|A016|65         |1        |2017-01-01 13:23:12|71.57500076293945  |143.1500015258789  |
|A002|100        |2        |2018-04-01 14:23:12|100                |100                |
|A003|12.1       |3        |2018-01-01 13:23:12|12.100000381469727 |12.100000381469727 |
|A004|200.15     |4        |2018-01-01 13:23:12|200.14999389648438 |200.14999389648438 |
|A021|600        |5        |2018-11-01 08:58:31|900.0750122070312  |1800.1500244140625 |
|A005|1200.15    |5        |2018-01-01 13:23:12|900.0750122070312  |1800.1500244140625 |
|A009|100.15     |8        |2018-01-01 15:23:12|100.1500015258789  |100.1500015258789  |
|A010|20.15      |9        |2018-01-01 16:23:12|20.149999618530273 |20.149999618530273 |
|A006|0.15       |11       |2018-01-01 17:23:12|0.15000000596046448|0.15000000596046448|
|A019|99         |12       |2018-02-11 13:23:12|99                 |99                 |
|A020|10.3       |13       |2018-04-01 13:23:12|25.40000009536743  |50.80000019073486  |
|A015|40.5       |13       |2018-01-01 13:23:12|17.31666672229767  |51.950000166893005 |
|A012|1.15       |13       |2018-01-01 13:03:12|20.82499998807907  |41.64999997615814  |
|A013|20.15      |14       |2018-02-01 13:53:12|20.149999618530273 |20.149999618530273 |
|A014|30.15      |15       |2018-03-01 13:23:12|65.44999980926514  |130.89999961853027 |
|A008|100.75     |15       |2018-01-01 14:23:12|65.44999980926514  |130.89999961853027 |
|A011|110.15     |21       |2018-01-01 14:23:12|110.1500015258789  |110.1500015258789  |
|A022|500        |31       |2018-11-11 08:59:02|857.5499877929688  |1715.0999755859375 |
|A007|1215.1     |31       |2018-01-01 10:23:12|857.5499877929688  |1715.0999755859375 |
+----+-----------+---------+-------------------+-------------------+-------------------+

从结果可以看出,id为A001订单属于边界值,没有前一行,因此平均订单金额为(100.15+100.15)/2=100.15;第二行id为A001的订单前后都有订单,所以平均订单金额为(100.15+100.15+88.15/3=96.15;以此类推就可以得到一个基于滑动窗口的动态平均订单值。此例中,窗口函数用到了传统的聚合函数avg(),用来计算动态的平均值。

 

4. 相关业务问题

  • 每个类别下用户最喜欢的产品是哪个?
  • 每个类别下用户点击最多的5个商品是什么?

4.1 解决方法

  • 分组取最大值(用关联子查询):查询每个学号成绩最大的所有信息
SELECT* FROM score AS a WHERE 成绩=( SELECT MAX(成绩) FROM score AS b WHERE a.`学号`=b.`学号`);
  • 分组取最小值:查询每个学号成绩最小的所有信息
SELECT* FROM score AS a WHERE 成绩=( SELECT MIN(成绩) FROM score AS b WHERE a.`学号`=b.`学号`);
  • 分组取最大N条记录:查询每个学生成绩最高的两个科目
SELECT * FROM (SELECT*,row_number() over (PARTITION BY 姓名 ORDER BY 成绩 DESC) AS ranking FROM test1) AS newtest WHERE ranking<=2;

说明:

  • 为了不受并列的影响,该题用row_number()
  • 易错写法:select*, row_number() over(partition by 姓名 order by 成绩 desc) as ranking from test where ranking<=2; 按照sql运行顺序,where后边不能加别名,因为select在where后运行
  • 涉及到既要分组又要排序的情况,要想到用窗口函数

 

4.2 TOPN问题模板

SELECT * FROM (SELECT*,row_number() over (PARTITION BY 姓名 ORDER BY 成绩 DESC) AS ranking FROM test1) AS newtest WHERE ranking<=N;

 

4.3 聚合函数作为窗口函数

聚合函数作为窗口函数,是起到"累加/累计"的效果,比如截止到本行,最大值,最小值是多少。与专用窗口函数的区别:括号中需要有指定列,不能为空

举例1:查询成绩的累加

SELECT*, SUM(成绩) over (ORDER BY 成绩 DESC) AS '求和' FROM class;

举例2:查找单科成绩高于该科目平均成绩的学生名单

(1)解法一:单科成绩需按科目分组,分组有两种:group by子句和窗口函数的partition by;要求平均值,通过avg()实现,并且找到大于平均值的学生,那么不能减少行数,则用partition by;

#错误解法:不能用where 成绩>分组平均值,因为where子句在select子句之后执行
select * ,avg(成绩) over (partition by 科目) as 分组平均值 from test3 where 成绩>分组平均值

#正确解法: 
select * from (select * ,avg(成绩) over (partition by 科目) as 分组平均值 from test3)as a where 成绩>分组平均值;

(2)解法二:关联子查询;

SELECT* FROM test3 AS a WHERE 成绩>(SELECT AVG(成绩) AS 平均值 FROM test3 AS b WHERE a.`科目`=b.`科目`);

 

5. 总结

1. 语法:窗口函数 over (partition by 用于分列的列名 order by 用于排序的列名)

2. 功能:既能分组又可以排序,且不改变行数

3. 分类:序号函数、分布函数、前后函数、头尾函数、聚合函数、其他函数等

4. 注意事项:一般写在select子句中

 

 

今天分享下MySQL中的窗口函数的实际操作,这对分析查询很有帮助,希望对大家有帮助。谢谢支持

欢迎关注微信公众号,访问更多精彩:数据星球DataStar

如需转载,请联系授权,谢谢合作。

 

引文链接:

[1]MySQL操作实战(二):窗口函数

https://blog.csdn.net/weixin_39010770/article/details/87862407

[2]数据分析系列 18/32 | MySQL中的数据分析函数实战

https://zhuanlan.zhihu.com/p/137711813

[3]MySQL 8.0窗口函数

https://www.cnblogs.com/DataArt/p/9961676.html

 

  • 5
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值