开窗——聚合函数与分析函数
应用场景
(1)排序
(2)动态分组group by
(3)top n
(4)累计计算
(5)层次查询
窗口函数
first_value(col):取分组内排序后,截止到当前行,第一个值
last_value(col): 取分组内排序后,截止到当前行,最后一个值
lead(col,n,default) :用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为null时候,取默认值,如不指定,则为null)
lag(col,n,default) :用于统计窗口内往上第n行值。第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为null时候,取默认值,如不指定,则为null)
over从句
1、over前使用标准的聚合函数count、sum、min、max、avg,也可以使用其他分析函数row_number()、rank()、dense_rank()、cume_dist、percent_rank、ntile(n)
2、over后括号内使用partition by语句进行分组,可以是一列或多列
3、over后括号内使用order by语句进行排序,可以是一列或多列
4、综合使用partition by与order by语句先后组再排序,可以是一列或多列,数据为分组内排序后截止当前行之前计算内容
窗口规范
开窗——聚合函数
count、sum、min、max、avg、collect_set、collect_list
with tmp as
(
select 'd001' as order_id,123 as user_id,20 as amt,'2020-01-01 00:23:33' as pay_time
union all select 'd002' as order_id,123 as user_id,25 as amt,'2020-01-02 00:23:33' as pay_time
union all select 'd005' as order_id,123 as user_id,250 as amt,'2020-01-04 00:23:33' as pay_time
union all select 'd007' as order_id,123 as user_id,100 as amt,'2020-01-06 00:23:33' as pay_time
union all select 'd003' as order_id,125 as user_id,15 as amt,'2020-01-03 00:23:33' as pay_time
union all select 'd004' as order_id,125 as user_id,39 as amt,'2020-01-03 00:23:33' as pay_time
union all select 'd006' as order_id,125 as user_id,25 as amt,'2020-01-05 00:23:33' as pay_time
union all select 'd008' as order_id,125 as user_id,359 as amt,'2020-01-07 00:23:33' as pay_time
)
select user_id,order_id,amt,pay_time
,count(*)over(partition by user_id) as `当前用户订单数`
,count(distinct user_id)over(partition by 1) as `当前用户数`
,sum(amt)over(partition by user_id) as `当前用户总金额`
,sum(amt)over(order by 1) as `总金额`
,sum(amt)over(partition by user_id order by pay_time) as `当前用户截止当前支付时间累计总金额`
,max(amt)over(partition by user_id) as `当前用户最大金额`
,max(amt)over(order by pay_time) as `截止当前成单时间最大成交金额`
,min(amt)over(partition by user_id) as `当前用户最小金额`
,avg(amt)over(partition by user_id) as `当前用户订单均额`
from tmp
user_id | order_id | amt | pay_time | 当前用户订单数 | 当前用户数 | 当前用户总金额 | 总金额 | 当前用户截止当前支付时间累计总金额 | 当前用户最大金额 | 截止当前成单时间最大成交金额 | 当前用户最小金额 | 当前用户订单均额 |
123 | d001 | 20 | 2020-01-01 00:23:33 | 4 | 2 | 395 | 833 | 20 | 250 | 20 | 20 | 98.75 |
123 | d002 | 25 | 2020-01-02 00:23:33 | 4 | 2 | 395 | 833 | 45 | 250 | 25 | 20 | 98.75 |
125 | d004 | 39 | 2020-01-03 00:23:33 | 4 | 2 | 438 | 833 | 54 | 359 | 39 | 15 | 109.5 |
125 | d003 | 15 | 2020-01-03 00:23:33 | 4 | 2 | 438 | 833 | 54 | 359 | 39 | 15 | 109.5 |
123 | d005 | 250 | 2020-01-04 00:23:33 | 4 | 2 | 395 | 833 | 295 | 250 | 250 | 20 | 98.75 |
125 | d006 | 25 | 2020-01-05 00:23:33 | 4 | 2 | 438 | 833 | 79 | 359 | 250 | 15 | 109.5 |
123 | d007 | 100 | 2020-01-06 00:23:33 | 4 | 2 | 395 | 833 | 395 | 250 | 250 | 20 | 98.75 |
125 | d008 | 359 | 2020-01-07 00:23:33 | 4 | 2 | 438 | 833 | 438 | 359 | 359 | 15 | 109.5 |
开窗——分析函数
row_number() 从1开始,顺序生成序列号,如按成绩排名,也可以用于获取某学科成绩第一名,或获取某学科成绩第一名的学员姓名或其他信息等。
row_number() over(partition by col1 order by col2) as rowid
结果:1,2,3,4
rank() 并列排序,即下一个数字会跳过
rank() over(partition by col1 order by col2) as rowid
结果:1,2,2,4,5
dense_rank() 并列排序,下一个数据不会跳过
dense_rank() over(partition by col1 order by col2) as rowid
结果:1,2,2,3,4
ntile(n) 用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。(https://blog.csdn.net/weixin_36190755/article/details/109056534)
ntile不支持rows between,比如 ntile(2) over(partition by col1 order by col2 rows between 3 preceding and current row)。
first_value(col) 当前分组内排序后col第一个值
last_value(col) 当前分组内排序后col最后一个值
lag(col1,n,default) over(partition by col1 order by col2) as up 向前取第n行
lead(col1,n,default) over(partition by col1 order by col2) as down 向后取第n行
cume_dist() 小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪水的人数,所占总人数的比例
percent_rank() 分组内当前行的rank值-1/分组内总行数-1