SQL-求用户连续购买天数以及与上次购买间隔天数
Step 1:对数据进行聚合/去重
根据需求不同,可能的名称也不尽相同,主要是理解思想.
第一步是对数据按天进行初步聚合(因为一个用户可能在某天有多次购买行为)
因为只是作为演示,所以只用单个用户进行
select member_id,order_date from TABLE_NAME where member_id='1690' group by member_id,order_date
结果如下:
member_id | order_date |
---|---|
1690 | 2021-01-01 |
1690 | 2021-01-02 |
1690 | 2021-01-03 |
1690 | 2021-01-04 |
1690 | 2021-01-06 |
1690 | 2021-01-07 |
1690 | 2021-01-08 |
1690 | 2021-01-09 |
1690 | 2021-01-10 |
1690 | 2021-01-11 |
1690 | 2021-01-12 |
1690 | 2021-01-13 |
1690 | 2021-01-15 |
结果就不展示完全了…
Step 2:对用户进行分组排序
我们把上一个结果表称为: TABLE_1
select
member_id
,order_date
,row_number() over(partition by member_id order by order_date) as rank_num
from
TABLE_1
结果如下:
member_id | order_date | rank_num |
---|---|---|
1690 | 2021-01-01 | 1 |
1690 | 2021-01-02 | 2 |
1690 | 2021-01-03 | 3 |
1690 | 2021-01-04 | 4 |
1690 | 2021-01-06 | 5 |
1690 | 2021-01-07 | 6 |
1690 | 2021-01-08 | 7 |
1690 | 2021-01-09 | 8 |
1690 | 2021-01-10 | 9 |
1690 | 2021-01-11 | 10 |
1690 | 2021-01-12 | 11 |
1690 | 2021-01-13 | 12 |
1690 | 2021-01-15 | 13 |
Step 3:日期与序号进行减法运算
思路: 如果我的日期是连续的.这时日期减去序号天数,那么连续的日期会得到同一个日期,我们暂且称之为起始日期.
我们把上一个结果表称为: TABLE_2
select
member_id
,order_date
,rank_num
,date_sub(order_date,rank_num) as origin_date
from
TABLE_2
结果如下:
member_id | order_date | rank_num | origin_date |
---|---|---|---|
1690 | 2021-01-01 | 1 | 2020-12-31 |
1690 | 2021-01-02 | 2 | 2020-12-31 |
1690 | 2021-01-03 | 3 | 2020-12-31 |
1690 | 2021-01-04 | 4 | 2020-12-31 |
1690 | 2021-01-06 | 5 | 2021-01-01 |
1690 | 2021-01-07 | 6 | 2021-01-01 |
1690 | 2021-01-08 | 7 | 2021-01-01 |
1690 | 2021-01-09 | 8 | 2021-01-01 |
1690 | 2021-01-10 | 9 | 2021-01-01 |
1690 | 2021-01-11 | 10 | 2021-01-01 |
1690 | 2021-01-12 | 11 | 2021-01-01 |
1690 | 2021-01-13 | 12 | 2021-01-01 |
1690 | 2021-01-15 | 13 | 2021-01-02 |
Step 4:获取开始连续的日期以及连续天数
其实从上面的结果表中就已经能够统计出那些用户的连续天数满足需求了.为了保留更多信息,所以还是可以继续进行加工
我们把上一个结果表称为: TABLE_3
select
table_a.member_id as member_id
,table_a.continuous_start_date as continuous_start_date
,count(table_a.continuous_start_date) as continuous_date
from
(
select
member_id
,min(order_date) over(partition by member_id,origin_date) as continuous_start_date
from
TABLE_3
) table_a
group by table_a.member_id,table_a.continuous_start_date
结果如下:
member_id | continuous_start_date | continuous_date |
---|---|---|
1690 | 2021-01-01 | 4 |
1690 | 2021-01-06 | 8 |
1690 | 2021-01-15 | 9 |
1690 | 2021-01-25 | 6 |
1690 | 2021-02-05 | 1 |
1690 | 2021-02-07 | 1 |
1690 | 2021-02-09 | 3 |
Step 5:使用Hive中lead函数
lead() over() 将下一行提到当前行来另起字段
lead(需要调整的字段名,下几行,空余部分用什么补充)
lag() over() 将上一行提到当前行来另起字段
lag(需要调整的字段名,上几行,空余部分用什么补充)
饱经沙场的都知道,产品的提数需求后面一般都会跟着无数的顺便、再来等等…
思路: 当 (continuous_start_date + continuous_date) - 下一行的continuous_start_date = 中断天数
其实算中断天数的方法很多.我这里就采用 lead函数来实现.我们把上面的结果表称为: TABLE_4
select
a.member_id as member_id
,a.continuous_start_date as continuous_start_date
,a.continuous_next_date as continuous_next_date
,a.continuous_date as continuous_date
,datediff(a.continuous_next_date,date_add(a.continuous_start_date,a.continuous_date)) as discontinuity_date
from
(
select
member_id as member_id
,continuous_start_date as continuous_start_date
,lead(continuous_start_date,1,0) over(partition by member_id,continuous_start_date) as continuous_next_date
,continuous_date as continuous_date
from
TABLE_4
)a
结果如下:
member_id | continuous_start_date | continuous_next_date | continuous_date | discontinuity_date |
---|---|---|---|---|
1690 | 2021-01-01 | 2021-01-06 | 4 | 1 |
1690 | 2021-01-06 | 2021-01-15 | 8 | 1 |
1690 | 2021-01-15 | 2021-01-25 | 9 | 1 |
1690 | 2021-01-25 | 2021-02-05 | 6 | 5 |
1690 | 2021-02-05 | 2021-02-07 | 1 | 1 |
1690 | 2021-02-07 | 2021-02-09 | 1 | 1 |
1690 | 2021-02-09 | 2021-02-13 | 3 | 1 |