需求
用户在文章页可以每天签到,签到按照某种规则可获取金币,需统计每个用户每月获取金币数。
有用户签到明细表:
CREATE TABLE tmp_tb_user_log (
uid INT COMMENT '用户ID',
artical_id INT COMMENT '视频ID',
in_time string COMMENT '进入时间',
out_time string COMMENT '离开时间',
sign_in int COMMENT '是否签到1是0否'
)
记录用户进入签到页,点击签到按钮后退出log数据。
部分表数据明细如下:
uid | artical_id | in_time | out_time | sign_in |
---|---|---|---|---|
1 | 0 | 2022-07-07 10:00:00 | 2022-07-07 10:00:09 | 1 |
1 | 0 | 2022-07-08 10:00:00 | 2022-07-08 10:00:09 | 1 |
1 | 0 | 2022-07-09 10:00:00 | 2022-07-09 10:00:42 | 1 |
1 | 0 | 2022-07-10 10:00:00 | 2022-07-10 10:00:09 | 1 |
1 | 0 | 2022-07-11 23:59:55 | 2022-07-11 23:59:59 | 1 |
1 | 0 | 2022-07-12 10:00:28 | 2022-07-12 10:00:50 | 1 |
1 | 0 | 2022-07-13 10:00:28 | 2022-07-13 10:00:50 | 1 |
1 | 0 | 2022-07-14 11:00:28 | 2022-07-14 11:00:50 | 1 |
1 | 0 | 2022-07-15 11:59:28 | 2022-07-16 00:01:20 | 1 |
领取金币规则:
(1)只要用户签到就能获取一枚金币
(2)连续签到第三天和第七天分别可以在每天一枚的基础上额外再获取2枚和6金币
(3)连续签到7天后进行重置,按照规则(1)和(2)进行新一轮领币
(4)用户进入时间和离开时间如果跨天 按照进入时间计算打卡时间
上述举例明细中,用户id:1在2022-07-07到2022-07-15号连续签到,按照上述金币规则,每天的获取金币数据如下:
uid | artical_id | in_time | out_time | sign_in | coin |
---|---|---|---|---|---|
1 | 0 | 2022-07-07 10:00:00 | 2022-07-07 10:00:09 | 1 | 1 |
1 | 0 | 2022-07-08 10:00:00 | 2022-07-08 10:00:09 | 1 | 1 |
1 | 0 | 2022-07-09 10:00:00 | 2022-07-09 10:00:42 | 1 | 3 |
1 | 0 | 2022-07-10 10:00:00 | 2022-07-10 10:00:09 | 1 | 1 |
1 | 0 | 2022-07-11 23:59:55 | 2022-07-11 23:59:59 | 1 | 1 |
1 | 0 | 2022-07-12 10:00:28 | 2022-07-12 10:00:50 | 1 | 1 |
1 | 0 | 2022-07-13 10:00:28 | 2022-07-13 10:00:50 | 1 | 7 |
1 | 0 | 2022-07-14 11:00:28 | 2022-07-14 11:00:50 | 1 | 1 |
1 | 0 | 2022-07-15 11:59:28 | 2022-07-16 00:01:20 | 1 | 1 |
用户id:1在2022-07-09是连续签到3天获得3金币,在2022-07-13连续签到7天,获得7金币。7月份共获得17金币。
分析
逆反思想思考,想要获取以上结果,得知道每个用户的有效签到连续日期分组,同一个用户可能有多个连续签到数据段,在每个数据段内需要row_number() 标行号,按照需求的金币规则计算每个连续签到数据段内的获取金币数。
- 1.如果我们已经有如下数据:
uid | artical_id | in_time | out_time | sign_in | rank |
---|---|---|---|---|---|
1 | 0 | 2022-07-07 10:00:00 | 2022-07-07 10:00:09 | 1 | 1 |
1 | 0 | 2022-07-08 10:00:00 | 2022-07-08 10:00:09 | 1 | 2 |
1 | 0 | 2022-07-09 10:00:00 | 2022-07-09 10:00:42 | 1 | 3 |
1 | 0 | 2022-07-10 10:00:00 | 2022-07-10 10:00:09 | 1 | 4 |
1 | 0 | 2022-07-11 23:59:55 | 2022-07-11 23:59:59 | 1 | 5 |
1 | 0 | 2022-07-12 10:00:28 | 2022-07-12 10:00:50 | 1 | 6 |
1 | 0 | 2022-07-13 10:00:28 | 2022-07-13 10:00:50 | 1 | 7 |
1 | 0 | 2022-07-14 11:00:28 | 2022-07-14 11:00:50 | 1 | 8 |
1 | 0 | 2022-07-15 11:59:28 | 2022-07-16 00:01:20 | 1 | 9 |
rank是每个用户每个连续签到数据段内按照in_time升序序号,如果有了这个序号,用这个序号对7取余后,处理余数为3和0(7天的整数,即每一个连续7天),其余都是1金币。就可以算出每个用户每天的获取金币数。
- 2.如果想要得到步骤1的数据,前提是先将每个用户每次的联系签到日期分到同一组中,这里很容易想到利用重分组思想,在用户每次签到相对于前一天变化的日期数据处标记数据为1,连续签到数据标记为0,再sum() over() 按照签到时间累加,即将每个用户每次联系签到数据分到同一组。可见举例数据:
假设有数据如下:
uid | artical_id | in_time | out_time | sign_in |
---|---|---|---|---|
1 | 0 | 2022-07-07 10:00:00 | 2022-07-07 10:00:09 | 1 |
1 | 0 | 2022-07-08 10:00:00 | 2022-07-08 10:00:09 | 1 |
1 | 0 | 2022-07-09 10:00:00 | 2022-07-09 10:00:42 | 1 |
1 | 0 | 2022-07-10 10:00:00 | 2022-07-10 10:00:09 | 0 |
1 | 0 | 2022-07-11 23:59:55 | 2022-07-11 23:59:59 | 1 |
1 | 0 | 2022-07-13 10:00:28 | 2022-07-13 10:00:50 | 1 |
1 | 0 | 2022-07-14 11:00:28 | 2022-07-14 11:00:50 | 1 |
用户在2022-07-10号进入签到页了 ,但是没有签到(sign_in为0),在2022-07-13直接没有来,即该用户应该有3段连续签到数据段。
分别是7号到9号,11号 和 13号到14号。分组应该是这样:
uid | artical_id | in_time | out_time | sign_in | group_name |
---|---|---|---|---|---|
1 | 0 | 2022-07-07 10:00:00 | 2022-07-07 10:00:09 | 1 | 0 |
1 | 0 | 2022-07-08 10:00:00 | 2022-07-08 10:00:09 | 1 | 0 |
1 | 0 | 2022-07-09 10:00:00 | 2022-07-09 10:00:42 | 1 | 0 |
1 | 0 | 2022-07-10 10:00:00 | 2022-07-10 10:00:09 | 0 | 1 |
1 | 0 | 2022-07-11 23:59:55 | 2022-07-11 23:59:59 | 1 | 2 |
1 | 0 | 2022-07-13 10:00:28 | 2022-07-13 10:00:50 | 1 | 3 |
1 | 0 | 2022-07-14 11:00:28 | 2022-07-14 11:00:50 | 1 | 3 |
可以看到group_name字段一共有0到3 四个分组,但是组号1是没有签到的。如何实现这种分组?
- 3.实现重分组,将每个用户的每段连续签到日期分到同一组。
构造每次签到相对于前一天变化的日期数据标记为1,连续日期标记为0,再sum() over()开窗累加即可。可见数据:
uid | artical_id | in_time | out_time | sign_in | sum_over |
---|---|---|---|---|---|
1 | 0 | 2022-07-07 10:00:00 | 2022-07-07 10:00:09 | 1 | 0 |
1 | 0 | 2022-07-08 10:00:00 | 2022-07-08 10:00:09 | 1 | 0 |
1 | 0 | 2022-07-09 10:00:00 | 2022-07-09 10:00:42 | 1 | 0 |
1 | 0 | 2022-07-10 10:00:00 | 2022-07-10 10:00:09 | 0 | 1 |
1 | 0 | 2022-07-11 23:59:55 | 2022-07-11 23:59:59 | 1 | 1 |
1 | 0 | 2022-07-13 10:00:28 | 2022-07-13 10:00:50 | 1 | 1 |
1 | 0 | 2022-07-14 11:00:28 | 2022-07-14 11:00:50 | 1 | 0 |
sum_over字段:
-
2022-07-10号相对于2022-07-09号是变化的(sign_in变为0),标记为1;
-
2022-07-11号相对于2022-07-10号是变化的(sign_in变为1),标记为1;
-
2022-07-13号相对于2022-07-11号是变化的(日期不连续),标记为1;
有sum_over字段,sum(sum_over) over(partition by uid order by in_time) 即得到步骤1中的group_name。
那构造sum_over需要注意什么?再往上一步想。 -
4.这一步就是具体的分析什么是相对于前一条数据是否变化
首先要明确,断签的条件有两个:
1、没来,数据表现为没有当天数据
2、来了没签到,数据表现为有当天数据,但是sign_in是0
所以要标注这两个条件以明确步骤3中的sum_over到底应该是1还是0
相对于前一条数据,很容易想到开窗函数lead() over() 或者 lag() over()。不清楚这两个开窗函数用法的同学可以查看俺写的另一篇介绍开窗函数的文章开窗函数,很nice的。
if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
flag_days:当前行减去上一行日期取天数差,如果差值大于1 ,标记为1,否则标记为0。这里理解为当前数据相对于上一条数据是连续日期为0,否则为1
sign_flag:当前行sign_in与上一行sign_in不一样则标记为1,否则标记为0。这里理解为从签到到未签到 和 从未签到到签到都被标记为1,其他标记为0。
有了上述flag_days 和 sign_flag。步骤3中的sum_over,很容易理解是当flag_days 和sign_flag都是0则sum_over标记为0,否则是1,怎么理解?
flag_days = 0 :当前行与上一行日期连续
sign_flag = 0 :当前行与上一行sign_in值没有变化(这里先不管是不是签到了,等于1,因为连续的没签到也是单独分为一组,不会和签到的分到同一组)
所以有变化的数据行(日期断连 或者 sign_in变化)被标记为1,其他为0。得到sum_over。
实现
1、准备数据
DROP TABLE IF EXISTS tmp_tb_user_log;
CREATE TABLE tmp_tb_user_log (
uid INT COMMENT '用户ID',
artical_id INT COMMENT '视频ID',
in_time string COMMENT '进入时间',
out_time string COMMENT '离开时间',
sign_in int COMMENT '是否签到'
) lifecycle 3;
INSERT INTO tmp_tb_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES
(1, 0, '2022-07-07 10:00:00', '2022-07-07 10:00:09', 1),
(1, 0, '2022-07-08 10:00:00', '2022-07-08 10:00:09', 1),
(1, 0, '2022-07-09 10:00:00', '2022-07-09 10:00:42', 1),
(1, 0, '2022-07-10 10:00:00', '2022-07-10 10:00:09', 1),
(1, 0, '2022-07-11 23:59:55', '2022-07-11 23:59:59', 1),
(1, 0, '2022-07-12 10:00:28', '2022-07-12 10:00:50', 1),
(1, 0, '2022-07-13 10:00:28', '2022-07-13 10:00:50', 1),
(1, 0, '2022-07-14 11:00:28', '2022-07-14 11:00:50', 1),
(1, 0, '2022-07-15 11:59:28', '2022-07-16 00:01:20', 1),
(2, 0, '2022-10-01 10:00:28', '2022-10-01 10:00:50', 1),
(2, 0, '2022-10-02 10:00:01', '2022-10-02 10:01:50', 1),
(2, 0, '2022-10-03 11:00:55', '2022-10-03 11:00:59', 1),
(2, 0, '2022-10-04 11:00:45', '2022-10-04 11:00:55', 0),
(2, 0, '2022-10-05 11:00:53', '2022-10-05 11:00:59', 1),
(2, 0, '2022-10-06 11:00:45', '2022-10-06 11:00:55', 1),
(3, 0, '2022-07-07 10:00:00', '2022-07-07 10:00:09', 1),
(3, 0, '2022-07-08 10:00:00', '2022-07-08 10:00:09', 1),
(3, 0, '2022-07-09 10:00:00', '2022-07-09 10:00:42', 1),
(3, 0, '2022-07-11 23:59:55', '2022-07-11 23:59:59', 1),
(3, 0, '2022-07-12 10:00:28', '2022-07-12 10:00:50', 1),
(3, 0, '2022-07-13 10:00:28', '2022-07-13 10:00:50', 1),
(3, 0, '2022-07-14 11:00:28', '2022-07-14 11:00:50', 1),
(3, 0, '2022-07-15 11:59:28', '2022-07-16 00:01:20', 1),
(4, 0, '2022-07-07 10:00:00', '2022-07-07 10:00:09', 1),
(4, 0, '2022-07-08 10:00:00', '2022-07-08 10:00:09', 1),
(4, 0, '2022-07-09 10:00:00', '2022-07-09 10:00:42', 1),
(4, 0, '2022-07-10 10:00:00', '2022-07-10 10:00:09', 1),
(4, 0, '2022-07-11 23:59:55', '2022-07-11 23:59:59', 1),
(4, 0, '2022-07-12 10:00:28', '2022-07-12 10:00:50', 0),
(4, 0, '2022-07-13 10:00:28', '2022-07-13 10:00:50', 1),
(4, 0, '2022-07-14 11:00:28', '2022-07-14 11:00:50', 1),
(4, 0, '2022-07-15 11:59:28', '2022-07-16 00:01:20', 1);
select * from tmp_tb_user_log;
2、构造flag_days 和 sign_flag
select
*,
if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
from tmp_tb_user_log
3、构造sum_over 和 group_name
select
t.uid,
t.artical_id,
t.in_time,
t.out_time,
t.sign_in,
sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_name
from
(select
*,
if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
from tmp_tb_user_log
) t
4、构造irank
select
t1.uid,
t1.artical_id,
t1.in_time,
t1.out_time,
t1.sign_in,
t1.group_name,
row_number() over(partition by t1.uid, t1.group_name order by t1.in_time) as irank
from
(select
t.uid,
t.artical_id,
t.in_time,
t.out_time,
t.sign_in,
sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_name
from
(select
*,
if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
from tmp_tb_user_log
) t
) t1
5、处理金币获取规则
select
t2.uid,
t2.artical_id,
t2.in_time,
t2.out_time,
t2.sign_in,
t2.group_name,
t2.irank,
case when t2.irank % 7 = 3 then 3 when t2.irank % 7 = 0 then 7 else 1 end as coin
from
(select
t1.uid,
t1.artical_id,
t1.in_time,
t1.out_time,
t1.sign_in,
t1.group_name,
row_number() over(partition by t1.uid, t1.group_name order by t1.in_time) as irank
from
(select
t.uid,
t.artical_id,
t.in_time,
t.out_time,
t.sign_in,
sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_name
from
(select
*,
if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
from tmp_tb_user_log
) t
) t1
) t2
6、分组计算每个用户每个月获取金币数
select
t3.uid
,substr(t3.in_time,1,7) as pmonth
,sum(if(t3.sign_in = 1,coin,0)) as coins
from
(select
t2.uid,
t2.artical_id,
t2.in_time,
t2.out_time,
t2.sign_in,
t2.group_name,
t2.irank,
case when t2.irank % 7 = 3 then 3 when t2.irank % 7 = 0 then 7 else 1 end as coin
from
(select
t1.uid,
t1.artical_id,
t1.in_time,
t1.out_time,
t1.sign_in,
t1.group_name,
row_number() over(partition by t1.uid, t1.group_name order by t1.in_time) as irank
from
(select
t.uid,
t.artical_id,
t.in_time,
t.out_time,
t.sign_in,
sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_name
from
(select
*,
if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
from tmp_tb_user_log
) t
) t1
) t2
) t3
group by t3.uid,substr(t3.in_time,1,7);
最后
喜欢的点赞、关注、收藏吧~ 你的支持是最大的创作动力~~