1. 窗口函数和普通聚合函数的区别
- 聚合函数是将多条记录合并为一条;窗口函数是每条记录都会执行,原来有几条记录最终执行完还是几条
- 聚合函数也可以用于窗口函数:窗口函数在逻辑上的执行顺序是在
FROM
、JOIN
、WHERE
、GROUP BY
、HAVING
之后,在ORDER BY
、LIMIT
、SELECT DISTINCT
之前。它执行之前GROUP BY
的聚合过程已经完成了,所以不会再产生数据聚合。
注:窗口函数是在 WHERE
之后执行的,所以如果 WHERE
子句需要用窗口函数作为条件,需要多套一层子查询,例如:
select b.user_name,b.goods_category
from
(select
user_name,
goods_category,
row_number() over(partition by user_name
order by count(order_pay) desc ) as rank
from user_goods_table
group by user_name,goods_category) b
where b.rank =1;
2. 常用窗口函数
2.1 Row_Number,Rank,Dense_Rank
row_number()
:从1开始,按照顺序,生成分组内记录的序列,row_number()
的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列;通常用于获取分组内排序第一的记录;获取一个session中的第一条refer等。
rank()
:生成数据项在分组中的排名(并列排序,跳过重复序号:1、1、3…)
dense_rank()
:生成数据项在分组中的排名(并列排序,不跳过重复序:1、1、2…)
2.2 SUM、AVG、MIN、MAX
SUM、AVG、MIN、MAX的使用方法类似,以下以 SUM 为例:
SELECT cookieid,createtime,pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,
-- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,
--从起点到当前行,结果同pv1
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv3,
--当前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime
ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv4,
--当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv5
--当前行+往后所有行
FROM test1;
结果数据示例:
cookieid createtime pv pv1 pv2 pv3 pv4 pv5
cookie1 2019-12-01 3 3 3 3 3 3
cookie2 2019-12-00 3 3 3 3 3 3
cookie3 2019-12-10 1 1 1 1 6 26
cookie3 2019-12-11 5 6 6 6 13 25
cookie3 2019-12-12 7 13 13 13 16 20
cookie3 2019-12-13 3 16 16 16 18 13
cookie3 2019-12-14 2 18 18 17 21 10
cookie3 2019-12-15 4 22 22 16 20 8
cookie3 2019-12-16 4 26 26 13 13 4
cookie4 2019-12-12 7 7 7 7 13 14
cookie4 2019-12-16 6 13 13 13 14 7
cookie4 2019-12-24 1 14 14 14 14 1
cookie5 2019-12-22 5 5 5 5 5 5
2.3 NTILE 函数
NTILE(n)
,用于将分组数据按照顺序切分成n片,返回当前切片值
注1:如果切片不均匀,默认增加第一个切片的分布
注2:NTILE
不支持 ROWS BETWEEN
举个栗子,找出购买总金额前20%的用户:
select b.user_name
from
(select
user_name,
ntile(5) over(order by sum(pay_amount) desc) as ntile_level
from user_sales_table
group by user_name ) b
where b.ntile_level = 1
2.4 分布函数 percent_rank(),cume_dist()
percent_rank()
:分组内当前行的RANK值-1 / 分组内总行数-1
SELECT cookieid,createtime,pv,
PERCENT_RANK() OVER(ORDER BY pv) AS rn1
from test1
cume_dist()
:返回分组内小于等于当前值的行数 / 分组内总行数
比如,可以利用cume_dist()
统计小于等于当前薪水的人数,所占总人数的比例:
SELECT department_id,employee_id,salary,
round(CUME_DIST() OVER(ORDER BY salary),2) AS cd1,
round(CUME_DIST() OVER(PARTITION BY department_id ORDER BY pv),2) AS cd2
FROM test1;
注:cd1没有partition,所有数据均为1组
2.5 LAG 和 LEAD 函数
LAG(col,n,DEFAULT)
:用于统计窗口内往上第n行值
LEAD(col,n,DEFAULT)
:用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT cookieid,createtime,pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01') OVER(PARTITION BY cookieid ORDER BY createtime) AS lag1,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS lag2
FROM test1;
2.6 FIRST_VALUE 和 LAST_VALUE 函数
FIRST_VALUE()
取分组内排序后,截止到当前行,第一个值
LAST_VALUE()
取分组内排序后,截止到当前行,最后一个值
SELECT cookieid,createtime,pv,
FIRST_VALUE(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS first
FROM test1;
想要取分组内排序后最后一个值,则需要变通一下:
将 order by
顺序颠倒再取 FIRST_VALUE()
参考链接:https://blog.csdn.net/dingchangxiu11/java/article/details/83145151
https://blog.csdn.net/Abysscarry/article/details/81408265