HiveSQL高频窗口函数企业案例

课堂目标

  • 掌握sum(),avg()用于累计计算的窗口函数
  • 掌握row_number(),rank()用于排序的窗口函数【最高频※※※】
  • 掌握ntile()用于分组查询的窗口函数【很低频】
  • 掌握lag(),lead()偏移分析窗口函数【很低频】


1.2018年每月的支付总额和当年累计支付总额
'''sum后面加需要累计求和的字段
over后面是按照月份的升序进行的【排到某月份前面的所有都进行汇总】
'''
select a.month,
a.pay_amount,
sum(a.pay_amount) over(order by a.month)
from
(select month(dt) month,
sum(pay_amount) pay_amount
from user_trade
where year(dt)='2018'
group by month(dt)) a;

image.png

2.2017-2018年每月的支付总额和当年累计支付总额
  • 先找出每年每月的支付金额
  • 是按年分组的前提下进行月份升序求和的
    partition by 起到分组的作用
    order by 按照什么顺序进行累加,默认升序,降序desc
select a.year,
a.month,
a.pay_amount,
sum(a.pay_amount) over(partition by a.year order by a.month)
from
(select year(dt) year,
month(dt) month,
sum(pay_amount) pay_amount
from user_trade
where year(dt) in (2017,2018)
group by year(dt),month(dt)) a;

image.png

3.2018年每个月的近三月移动平均支付金额

移动范围:rows between 2 preceding and current row
image.png

select a.month,
a.pay_amount,
avg(a.pay_amount) over(order by a.month rows between 2 preceding and current row)
from 
(select month(dt) month,
sum(pay_amount) pay_amount
from user_trade
where year(dt)='2018'
group by month(dt)) a;

image.png

方法总结

image.png

拓展:
max(…) over(parttition by …order by …rows between …and…)
min(…) over(parttition by …order by …rows between …and…)



row_number() over(…)
rank() over(…)
dense_rand() over(…)

1.2019年1月,用户购买商品品类数量的排名
select user_name,
count(distinct goods_category),
row_number() over(order by count(distinct goods_category)),
rank() over(order by count(distinct goods_category)),
dense_rank() over(order by count(distinct goods_category))
from user_trade
where substr(dt,1,7)='2019-01'
group by user_name;

image.png
row_number:名额是固定的,挑选一定数量的人
rank:成绩排名
dense_number:三种类型的奖牌,同样分数获得

2.选出2019年支付金额排名在第 10、20、30 名的用户
select a.user_name,
a.pay_amount,
a.rank
from
(select user_name,
sum(pay_amount) pay_amount,
rank() over(order by sum(pay_amount) desc) rank
from user_trade
where year(dt)='2019'
group by user_name) a
where a.rank in (10,20,30);

image.png



分组窗口函数
ntile(n) over(partition by … order by …)

1.将2019年1月的支付用户,按照支付金额分成5组
select user_name,
sum(pay_amount) pay_amount,
ntile(5) over(order by sum(pay_amount) desc)
from user_trade
where dt between'2019-01-01' and '2019-01-31'
group by user_name;

'''
dt between'2019-01-01' and '2019-01-31'
等价于
substr(dt,1,7)='2019-01'
'''

image.png

2.选出2019年退款金额排名前10%的用户
select a.user_name,
a.refund_amount,
a.rank
from
(select user_name,
sum(refund_amount) refund_amount,
ntile(10) over(order by sum(refund_amount) desc) rank 
from user_refund
where year(dt)='2019'
group by user_name) a
where a.rank=1;

image.png



偏移分析窗口函数

  • 这一次和上一次,今天和明天,这个月和上个月等
    lag(exp_str,offset,defval) over(partition by … order by …)【向前】
    lead(exp_str,offset,defval) over(partition by … order by …)【向后】
    exp_str:字段名称
    offset:偏移量(默认值是1)
    defval:默认值,超出范围
1.Alice和Alexander的各种时间偏移量

【场景】购买用户购买的时间间隔分布

select user_name,
dt,
lag(dt,1,dt) over(partition by user_name order by dt),
lag(dt) over(partition by user_name order by dt),
lag(dt,2,dt) over(partition by user_name order by dt),
lag(dt,2) over(partition by user_name order by dt)
from user_trade
where dt>'0'
and user_name in ('Alice','Alexander');

image.png

2.支付时间间隔超过100天的用户数【※】
select count(distinct user_name)
from
(select user_name,
dt,
lead(dt) over(partition by user_name order by dt) lead_dt
from user_trade
where dt>'0') a
where a.lead_dt is not null
and datediff(a.lead_dt,a.dt)>100;
'''结果:180'''


重点练习

1.每个城市,不同性别,2018年支付金额最高的TOP3用户

【user_trade和user_info表】

'''
1.每个用户的支付金额汇总
2.每个用户的城市和性别(直接左连接user_info表)
3.每个城市不同性别的用户的支付金额排名
4.找出TOP3
'''
select c.city,
c.sex,
c.user_name,
c.pay_amount,
c.rank
from 
(select b.city,
b.sex,
a.user_name,
a.pay_amount,
row_number() over(partition by b.city,b.sex order by a.pay_amount desc) rank
from
(select user_name,
sum(pay_amount) pay_amount
from user_trade
where year(dt)='2018'
group by user_name) a
left join user_info b
on a.user_name=b.user_name
-- 这里不要 group by 了,因为在排序中已经指定了
) c
where c.rank<=3;

image.png

2.2.每个手机品牌退款金额前25%的用户【desc千万不要忘记】【user_refund和user_info表】

ntile(4) over()

'''
自己做错原因:
1.将user_info 表左连接user_trade表
而应该user_trade表左连接user_info 表
2.可以直接 left join user_info b 表示可以去全部字段
a.user_name=b.user_name
而当left join (select ... from ...) b 时,就一定要写select user_name,否则就不能
a.user_name=b.user_name
3.get_json_object(extra1,'$.phonebrand') phonebrand
等价于extra2['phonebrand'] phonebrand

'''
'''
1.每个用户的手机品牌
2.每个用户的退款金额
'''
select c.user_name,
c.phonebrand,
c.refund_amount,
c.level
from
(select a.user_name,
b.phonebrand,
a.refund_amount,
ntile(4) over(partition by b.phonebrand order by a.refund_amount desc) level
from
(select user_name,
sum(refund_amount) refund_amount
from user_refund
group by user_name) a
left join 
(select user_name,
get_json_object(extra1,'$.phonebrand') phonebrand
from user_info) b
on a.user_name=b.user_name) c
where c.level=1;

image.png

3.计算每12个月的用户累计支付金额
select a.month,
a.pay_amount,
sum(a.pay_amount) over(order by a.month rows between 11 preceding and current row)
from
(select substr(dt,1,7) month,
sum(pay_amount) pay_amount
from user_trade
where dt>'0'
group by substr(dt,1,7)) a;

image.png

4.计算出每4个月的最大退款金额
select a.month,
a.refund_amount,
max(a.refund_amount) over(order by a.month rows between 3 preceding and current row)
from
(select substr(dt,1,7) month,
sum(refund_amount) refund_amount
from user_refund
where dt>'0'
group by substr(dt,1,7)) a;

image.png

5.退款时间间隔最长的用户
  • 用户的退款时间间隔(天),然后order by 就好,真搞不懂当初为啥要用max()

select a.user_name,
datediff(a.lead_dt,a.dt) date_diff
from
(select user_name,
dt,
lead(dt) over(partition by user_name order by dt) lead_dt 
from user_refund
where dt>'0') a
where a.lead_dt is not null;

image.png

  • 退款时间间隔最长的用户【错误】:要用max()必须要用 group by
select a.user_name,
max(a.date_diff)
from
(select a.user_name,
datediff(a.lead_dt,a.dt) date_diff
from
(select user_name,
dt,
lead(dt) over(partition by user_name order by dt) lead_dt 
from user_refund
where dt>'0') a
where a.lead_dt is not null) a;
  • 正确写法:
select b.user_name,
b.date_diff
from
(select a.user_name,
datediff(a.lead_dt,a.dt) date_diff
from
(select user_name,
dt,
lead(dt) over(partition by user_name order by dt) lead_dt 
from user_refund
where dt>'0') a
where a.lead_dt is not null) b
order by b.date_diff desc limit 1;

image.png

  • 1
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值