在实际开发中经常会有一些求连续类型指标,如系统连续几天登录、商城连续几天下单客户,等等关于连续时间类型的指标问题
需求:下面会做一个求连续三天下单的客户
1、测试数据
001 zs p001 2020-02-04
002 zs p005 2020-03-05
003 zs p006 2020-03-06
004 zs p002 2020-03-07
005 zs p001 2020-03-19
006 ls p002 2020-03-04
007 ls p001 2020-03-05
008 ww p002 2020-03-07
009 ww p001 2020-03-08
010 ww p002 2020-03-09
011 ww p001 2020-03-14
012 zl p002 2020-03-09
013 zl p001 2020-03-10
014 tq p002 2020-03-09
015 tq p001 2020-03-10
016 tq p002 2020-03-11
017 tq p001 2020-03-17
018 tq p002 2020-03-19
019 tq p001 2020-03-20
2、建表
hive (default)> create table t_order(order_id string,user_id string,product_id string,order_date string) row format delimited fields terminated by '\t';
3、导入数据
hive (default)> load data local inpath '/opt/module/hive/order.txt' into table t_order;
解决过程:
1、首先为相同客户按照购买时间排序
select order_id,user_id,product_id,order_date,
rank() over(partition by user_id order by order_date) st
from t_order;
2、日期列减去排序列,如果日期是连续的那么减去排序得到的日期是相同的,这就是数学中等差数列减等差数列,结果一直
select order_id,user_id,order_date,date_sub(order_date,st) js_date
from(
select order_id,user_id,product_id,order_date,rank() over(partition by user_id order by order_date) st from t_order
) t1;
结果:看如下数据
order_id user_id order_date js_date
006 ls 2020-03-04 2020-03-03
007 ls 2020-03-05 2020-03-03
014 tq 2020-03-09 2020-03-08
015 tq 2020-03-10 2020-03-08
016 tq 2020-03-11 2020-03-08
017 tq 2020-03-17 2020-03-13
018 tq 2020-03-19 2020-03-14
019 tq 2020-03-20 2020-03-14
008 ww 2020-03-07 2020-03-06
009 ww 2020-03-08 2020-03-06
010 ww 2020-03-09 2020-03-06
011 ww 2020-03-14 2020-03-10
012 zl 2020-03-09 2020-03-08
013 zl 2020-03-10 2020-03-08
001 zs 2020-02-04 2020-02-03
002 zs 2020-03-05 2020-03-03
003 zs 2020-03-06 2020-03-03
004 zs 2020-03-07 2020-03-03
005 zs 2020-03-19 2020-03-14
3、如上已经的出了等差数列相减后的日期,如果只需要求出连续三天购买的用户的使用分组,按照user_id,js_date分区大于等于3的就是满足条件的用户
select t2.user_id,count(*)
from(
select t1.order_id,t1.user_id,t1.order_date,date_sub(t1.order_date,st) js_date
from(
select t.order_id,t.user_id,t.product_id,t.order_date,rank() over(partition by t.user_id order by t.order_date) st from t_order t
) t1
) t2 group by t2.user_id,t2.js_date having count(*)>=3;
分组结果
t2.user_id _c1
tq 3
ww 3
zs 3
4、如果说不仅需要求连续三天购买的客户,还要求出具体购买明细,那么就不能用分组函数了,因为无法显示order_id,order_date这些信息
使用over 开窗函数来解决此问题
4.1 首先根据user_id,js_date 分区求和,计算连续日期购买的时间和
select t2.order_id,t2.user_id,t2.order_date,count(*) over(partition by t2.user_id,t2.js_date )
from(
select t1.order_id,t1.user_id,t1.order_date,date_sub(t1.order_date,st) js_date
from(
select t.order_id,t.user_id,t.product_id,t.order_date,rank() over(partition by t.user_id order by t.order_date) st from t_order t
) t1
) t2;
007 ls 2020-03-05 2
006 ls 2020-03-04 2
014 tq 2020-03-09 3
015 tq 2020-03-10 3
016 tq 2020-03-11 3
017 tq 2020-03-17 1
018 tq 2020-03-19 2
019 tq 2020-03-20 2
009 ww 2020-03-08 3
008 ww 2020-03-07 3
010 ww 2020-03-09 3
011 ww 2020-03-14 1
012 zl 2020-03-09 2
013 zl 2020-03-10 2
001 zs 2020-02-04 1
002 zs 2020-03-05 3
003 zs 2020-03-06 3
004 zs 2020-03-07 3
005 zs 2020-03-19 1
4.2 根据开窗计算结果,判断大于等于3 的就是连续三天购买过的客户明细数据
select t3.order_id,t3.user_id,t3.order_date
from(
select t2.order_id,t2.user_id,t2.order_date,count(*) over(partition by t2.user_id,t2.js_date) ct
from(
select t1.order_id,t1.user_id,t1.order_date,date_sub(t1.order_date,st) js_date
from(
select t.order_id,t.user_id,t.product_id,t.order_date,rank() over(partition by t.user_id order by t.order_date) st from t_order t
) t1
) t2
) t3 where t3.ct >=3 order by t3.user_id,t3.order_date;
结果如下:连续三天购买的客户订单明细
t3.order_id t3.user_id t3.order_date
014 tq 2020-03-09
015 tq 2020-03-10
016 tq 2020-03-11
008 ww 2020-03-07
009 ww 2020-03-08
010 ww 2020-03-09
002 zs 2020-03-05
003 zs 2020-03-06
004 zs 2020-03-07