hiveSql 百度面试题-连续签到领金币

文章详细分析了一道HiveSQL面试题,涉及用户每日签到获取金币的场景。用户签到数据按日期连续性分组,并使用窗口函数计算每个用户每日获取的金币数,包括基础金币和连续签到奖励。具体步骤包括构造标志字段、计算分组、分配行号以及应用签到规则计算金币。
摘要由CSDN通过智能技术生成

hiveSql 百度面试题-连续签到领金币

需求

用户在文章页可以每天签到,签到按照某种规则可获取金币,需统计每个用户每月获取金币数。

有用户签到明细表:

CREATE TABLE tmp_tb_user_log (
    uid INT COMMENT '用户ID',
    artical_id INT COMMENT '视频ID',
    in_time string COMMENT '进入时间',
    out_time string COMMENT '离开时间',
    sign_in int COMMENT '是否签到1是0否'
)

记录用户进入签到页,点击签到按钮后退出log数据。
部分表数据明细如下:

uidartical_idin_timeout_timesign_in
102022-07-07 10:00:002022-07-07 10:00:091
102022-07-08 10:00:002022-07-08 10:00:091
102022-07-09 10:00:002022-07-09 10:00:421
102022-07-10 10:00:002022-07-10 10:00:091
102022-07-11 23:59:552022-07-11 23:59:591
102022-07-12 10:00:282022-07-12 10:00:501
102022-07-13 10:00:282022-07-13 10:00:501
102022-07-14 11:00:282022-07-14 11:00:501
102022-07-15 11:59:282022-07-16 00:01:201

领取金币规则:

(1)只要用户签到就能获取一枚金币
(2)连续签到第三天和第七天分别可以在每天一枚的基础上额外再获取2枚和6金币
(3)连续签到7天后进行重置,按照规则(1)和(2)进行新一轮领币
(4)用户进入时间和离开时间如果跨天 按照进入时间计算打卡时间

上述举例明细中,用户id:1在2022-07-07到2022-07-15号连续签到,按照上述金币规则,每天的获取金币数据如下:

uidartical_idin_timeout_timesign_incoin
102022-07-07 10:00:002022-07-07 10:00:0911
102022-07-08 10:00:002022-07-08 10:00:0911
102022-07-09 10:00:002022-07-09 10:00:4213
102022-07-10 10:00:002022-07-10 10:00:0911
102022-07-11 23:59:552022-07-11 23:59:5911
102022-07-12 10:00:282022-07-12 10:00:5011
102022-07-13 10:00:282022-07-13 10:00:5017
102022-07-14 11:00:282022-07-14 11:00:5011
102022-07-15 11:59:282022-07-16 00:01:2011

用户id:1在2022-07-09是连续签到3天获得3金币,在2022-07-13连续签到7天,获得7金币。7月份共获得17金币。

分析

逆反思想思考,想要获取以上结果,得知道每个用户的有效签到连续日期分组,同一个用户可能有多个连续签到数据段,在每个数据段内需要row_number() 标行号,按照需求的金币规则计算每个连续签到数据段内的获取金币数。

  • 1.如果我们已经有如下数据:
uidartical_idin_timeout_timesign_inrank
102022-07-07 10:00:002022-07-07 10:00:0911
102022-07-08 10:00:002022-07-08 10:00:0912
102022-07-09 10:00:002022-07-09 10:00:4213
102022-07-10 10:00:002022-07-10 10:00:0914
102022-07-11 23:59:552022-07-11 23:59:5915
102022-07-12 10:00:282022-07-12 10:00:5016
102022-07-13 10:00:282022-07-13 10:00:5017
102022-07-14 11:00:282022-07-14 11:00:5018
102022-07-15 11:59:282022-07-16 00:01:2019

rank是每个用户每个连续签到数据段内按照in_time升序序号,如果有了这个序号,用这个序号对7取余后,处理余数为3和0(7天的整数,即每一个连续7天),其余都是1金币。就可以算出每个用户每天的获取金币数。

  • 2.如果想要得到步骤1的数据,前提是先将每个用户每次的联系签到日期分到同一组中,这里很容易想到利用重分组思想,在用户每次签到相对于前一天变化的日期数据处标记数据为1,连续签到数据标记为0,再sum() over() 按照签到时间累加,即将每个用户每次联系签到数据分到同一组。可见举例数据:

假设有数据如下:

uidartical_idin_timeout_timesign_in
102022-07-07 10:00:002022-07-07 10:00:091
102022-07-08 10:00:002022-07-08 10:00:091
102022-07-09 10:00:002022-07-09 10:00:421
102022-07-10 10:00:002022-07-10 10:00:090
102022-07-11 23:59:552022-07-11 23:59:591
102022-07-13 10:00:282022-07-13 10:00:501
102022-07-14 11:00:282022-07-14 11:00:501

用户在2022-07-10号进入签到页了 ,但是没有签到(sign_in为0),在2022-07-13直接没有来,即该用户应该有3段连续签到数据段。
分别是7号到9号,11号 和 13号到14号。分组应该是这样:

uidartical_idin_timeout_timesign_ingroup_name
102022-07-07 10:00:002022-07-07 10:00:0910
102022-07-08 10:00:002022-07-08 10:00:0910
102022-07-09 10:00:002022-07-09 10:00:4210
102022-07-10 10:00:002022-07-10 10:00:0901
102022-07-11 23:59:552022-07-11 23:59:5912
102022-07-13 10:00:282022-07-13 10:00:5013
102022-07-14 11:00:282022-07-14 11:00:5013

可以看到group_name字段一共有0到3 四个分组,但是组号1是没有签到的。如何实现这种分组?

  • 3.实现重分组,将每个用户的每段连续签到日期分到同一组。
    构造每次签到相对于前一天变化的日期数据标记为1,连续日期标记为0,再sum() over()开窗累加即可。可见数据:
uidartical_idin_timeout_timesign_insum_over
102022-07-07 10:00:002022-07-07 10:00:0910
102022-07-08 10:00:002022-07-08 10:00:0910
102022-07-09 10:00:002022-07-09 10:00:4210
102022-07-10 10:00:002022-07-10 10:00:0901
102022-07-11 23:59:552022-07-11 23:59:5911
102022-07-13 10:00:282022-07-13 10:00:5011
102022-07-14 11:00:282022-07-14 11:00:5010

sum_over字段:

  • 2022-07-10号相对于2022-07-09号是变化的(sign_in变为0),标记为1;

  • 2022-07-11号相对于2022-07-10号是变化的(sign_in变为1),标记为1;

  • 2022-07-13号相对于2022-07-11号是变化的(日期不连续),标记为1;
    有sum_over字段,sum(sum_over) over(partition by uid order by in_time) 即得到步骤1中的group_name。
    那构造sum_over需要注意什么?再往上一步想。

  • 4.这一步就是具体的分析什么是相对于前一条数据是否变化
    首先要明确,断签的条件有两个:
    1、没来,数据表现为没有当天数据
    2、来了没签到,数据表现为有当天数据,但是sign_in是0
    所以要标注这两个条件以明确步骤3中的sum_over到底应该是1还是0

相对于前一条数据,很容易想到开窗函数lead() over() 或者 lag() over()。不清楚这两个开窗函数用法的同学可以查看俺写的另一篇介绍开窗函数的文章开窗函数,很nice的。

if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag

flag_days:当前行减去上一行日期取天数差,如果差值大于1 ,标记为1,否则标记为0。这里理解为当前数据相对于上一条数据是连续日期为0,否则为1
sign_flag:当前行sign_in与上一行sign_in不一样则标记为1,否则标记为0。这里理解为从签到未签到 和 从未签到签到都被标记为1,其他标记为0。

有了上述flag_days 和 sign_flag。步骤3中的sum_over,很容易理解是当flag_days 和sign_flag都是0则sum_over标记为0,否则是1,怎么理解?
flag_days = 0 :当前行与上一行日期连续
sign_flag = 0 :当前行与上一行sign_in值没有变化(这里先不管是不是签到了,等于1,因为连续的没签到也是单独分为一组,不会和签到的分到同一组)
所以有变化的数据行(日期断连 或者 sign_in变化)被标记为1,其他为0。得到sum_over。

实现

1、准备数据


DROP TABLE IF EXISTS tmp_tb_user_log;
CREATE TABLE tmp_tb_user_log (
    uid INT COMMENT '用户ID',
    artical_id INT COMMENT '视频ID',
    in_time string COMMENT '进入时间',
    out_time string COMMENT '离开时间',
    sign_in int COMMENT '是否签到'
) lifecycle 3;
 
INSERT INTO tmp_tb_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES
  (1, 0, '2022-07-07 10:00:00', '2022-07-07 10:00:09', 1),
  (1, 0, '2022-07-08 10:00:00', '2022-07-08 10:00:09', 1),
  (1, 0, '2022-07-09 10:00:00', '2022-07-09 10:00:42', 1),
  (1, 0, '2022-07-10 10:00:00', '2022-07-10 10:00:09', 1),
  (1, 0, '2022-07-11 23:59:55', '2022-07-11 23:59:59', 1),
  (1, 0, '2022-07-12 10:00:28', '2022-07-12 10:00:50', 1),
  (1, 0, '2022-07-13 10:00:28', '2022-07-13 10:00:50', 1),
  (1, 0, '2022-07-14 11:00:28', '2022-07-14 11:00:50', 1),
  (1, 0, '2022-07-15 11:59:28', '2022-07-16 00:01:20', 1),
  (2, 0, '2022-10-01 10:00:28', '2022-10-01 10:00:50', 1),
  (2, 0, '2022-10-02 10:00:01', '2022-10-02 10:01:50', 1),
  (2, 0, '2022-10-03 11:00:55', '2022-10-03 11:00:59', 1),
  (2, 0, '2022-10-04 11:00:45', '2022-10-04 11:00:55', 0),
  (2, 0, '2022-10-05 11:00:53', '2022-10-05 11:00:59', 1),
  (2, 0, '2022-10-06 11:00:45', '2022-10-06 11:00:55', 1),
  (3, 0, '2022-07-07 10:00:00', '2022-07-07 10:00:09', 1),
  (3, 0, '2022-07-08 10:00:00', '2022-07-08 10:00:09', 1),
  (3, 0, '2022-07-09 10:00:00', '2022-07-09 10:00:42', 1),
  (3, 0, '2022-07-11 23:59:55', '2022-07-11 23:59:59', 1),
  (3, 0, '2022-07-12 10:00:28', '2022-07-12 10:00:50', 1),
  (3, 0, '2022-07-13 10:00:28', '2022-07-13 10:00:50', 1),
  (3, 0, '2022-07-14 11:00:28', '2022-07-14 11:00:50', 1),
  (3, 0, '2022-07-15 11:59:28', '2022-07-16 00:01:20', 1),
  (4, 0, '2022-07-07 10:00:00', '2022-07-07 10:00:09', 1),
  (4, 0, '2022-07-08 10:00:00', '2022-07-08 10:00:09', 1),
  (4, 0, '2022-07-09 10:00:00', '2022-07-09 10:00:42', 1),
  (4, 0, '2022-07-10 10:00:00', '2022-07-10 10:00:09', 1),
  (4, 0, '2022-07-11 23:59:55', '2022-07-11 23:59:59', 1),
  (4, 0, '2022-07-12 10:00:28', '2022-07-12 10:00:50', 0),
  (4, 0, '2022-07-13 10:00:28', '2022-07-13 10:00:50', 1),
  (4, 0, '2022-07-14 11:00:28', '2022-07-14 11:00:50', 1),
  (4, 0, '2022-07-15 11:59:28', '2022-07-16 00:01:20', 1);

select * from tmp_tb_user_log;

2、构造flag_days 和 sign_flag

select 
   *,
    if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
    if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
from tmp_tb_user_log

3、构造sum_over 和 group_name

select 
   t.uid,
    t.artical_id,
    t.in_time,
    t.out_time,
    t.sign_in,
    sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_name
from
    (select 
        *,
        if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
        if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
    from tmp_tb_user_log
    ) t

4、构造irank

select 
    t1.uid,
    t1.artical_id,
    t1.in_time,
    t1.out_time,
    t1.sign_in,
    t1.group_name,
    row_number() over(partition by t1.uid, t1.group_name order by t1.in_time) as irank
from
    (select 
        t.uid,
        t.artical_id,
        t.in_time,
        t.out_time,
        t.sign_in,
        sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_name
    from
        (select 
            *,
            if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
            if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
        from tmp_tb_user_log
        ) t
    ) t1

5、处理金币获取规则

select 
    t2.uid,
    t2.artical_id,
    t2.in_time,
    t2.out_time,
    t2.sign_in,
    t2.group_name,
    t2.irank,
    case when t2.irank % 7 = 3 then 3 when t2.irank % 7 = 0 then 7 else 1 end as coin
from
    (select 
        t1.uid,
        t1.artical_id,
        t1.in_time,
        t1.out_time,
        t1.sign_in,
        t1.group_name,
        row_number() over(partition by t1.uid, t1.group_name order by t1.in_time) as irank
    from
        (select 
            t.uid,
            t.artical_id,
            t.in_time,
            t.out_time,
            t.sign_in,
            sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_name
        from
            (select 
                *,
                if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
                if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
            from tmp_tb_user_log
            ) t
        ) t1
    ) t2

6、分组计算每个用户每个月获取金币数

select 
    t3.uid
    ,substr(t3.in_time,1,7) as pmonth
    ,sum(if(t3.sign_in = 1,coin,0)) as coins
from
    (select 
        t2.uid,
        t2.artical_id,
        t2.in_time,
        t2.out_time,
        t2.sign_in,
        t2.group_name,
        t2.irank,
        case when t2.irank % 7 = 3 then 3 when t2.irank % 7 = 0 then 7 else 1 end as coin
    from
        (select 
            t1.uid,
            t1.artical_id,
            t1.in_time,
            t1.out_time,
            t1.sign_in,
            t1.group_name,
            row_number() over(partition by t1.uid, t1.group_name order by t1.in_time) as irank
        from
            (select 
                t.uid,
                t.artical_id,
                t.in_time,
                t.out_time,
                t.sign_in,
                sum(if(flag_days = 0 and sign_flag = 0,0,1)) over(partition by uid order by in_time) as group_name
            from
                (select 
                    *,
                    if(datediff(cast(in_time as datetime), cast(lag(in_time,1,in_time) over(partition by uid order by in_time) as datetime),'dd') > 1, 1, 0) as flag_days,
                    if(lag(sign_in,1,sign_in) over(partition by uid order by in_time) != sign_in, 1, 0) as sign_flag
                from tmp_tb_user_log
                ) t
            ) t1
        ) t2
    ) t3
group by t3.uid,substr(t3.in_time,1,7);

最后

喜欢的点赞、关注、收藏吧~ 你的支持是最大的创作动力~~

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

@nanami

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值