需求解析
今天写hive的SQL题目,遇到了一个有趣的题目,题目的需求如下:
用户每天签到可以领 1 金币,并可以累计签到天数,连续签到的第 3、7 天分别可以额外领 2 和 6 金币。 每连续签到 7 天重新累积签到天数。 计算用户从 2021 年 7 月以来每个月获得的金币数,结果按照月份、ID 升序排序。
现在来分析一下这个需求,通过转换可以得知:用户每登录一天就可以获得一个金币,连续登录的第三天可以获得三个金币,连续登录的第七天可以获得七个金币。
那么现在问题来到了怎么进行7天一统计呢?
有个思路,一共是七天,给每天增加一个标记,第三天是3,第七天是7,其余的全是1。
那么怎么给第三天和第七天打标记呢?
现在来找规律:
第三天登录:3,10,17,24 .......
第七天登录:7,14,21,28......
date-3=0 or (date-3) % 7=0
date % 7 = 0
那么好,现在上代码
DROP TABLE IF EXISTS user_sign;
CREATE TABLE user_sign
(
`user_id` varchar(32),
`id` varchar(32),
`start_time` timestamp,
`end_time` timestamp,
`sign` int
)
COMMENT '用户签到表'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
NULL DEFINED AS ''
LOCATION '/warehouse/sdc/rds/user_sign';
INSERT
INTO user_sign
VALUES ('101', '0', CAST('2021-07-07 02:00:00' AS TIMESTAMP), CAST('2021-07-07 02:00:09' AS TIMESTAMP), 1),
('101', '0', CAST('2021-07-08 02:00:00' AS TIMESTAMP), CAST('2021-07-08 02:00:09' AS TIMESTAMP), 1),
('101', '0', CAST('2021-07-09 02:00:00' AS TIMESTAMP), CAST('2021-07-09 02:00:42' AS TIMESTAMP), 1),
('101', '0', CAST('2021-07-10 02:00:00' AS TIMESTAMP), CAST('2021-07-10 02:00:09' AS TIMESTAMP), 1),
('101', '0', CAST('2021-07-11 15:59:55' AS TIMESTAMP), CAST('2021-07-11 15:59:59' AS TIMESTAMP), 1),
('101', '0', CAST('2021-07-12 02:00:28' AS TIMESTAMP), CAST('2021-07-12 02:00:50' AS TIMESTAMP), 1),
('101', '0', CAST('2021-07-13 02:00:28' AS TIMESTAMP), CAST('2021-07-13 02:00:50' AS TIMESTAMP), 1),
('102', '0', CAST('2021-10-01 02:00:28' AS TIMESTAMP), CAST('2021-10-01 02:00:50' AS TIMESTAMP), 1),
('102', '0', CAST('2021-10-02 02:00:01' AS TIMESTAMP), CAST('2021-10-02 02:01:50' AS TIMESTAMP), 1),
('102', '0', CAST('2021-10-03 03:00:55' AS TIMESTAMP), CAST('2021-10-03 03:00:59' AS TIMESTAMP), 1),
('102', '0', CAST('2021-10-04 03:00:45' AS TIMESTAMP), CAST('2021-10-04 03:00:55' AS TIMESTAMP), 0),
('102', '0', CAST('2021-10-05 03:00:53' AS TIMESTAMP), CAST('2021-10-05 03:00:59' AS TIMESTAMP), 1),
('102', '0', CAST('2021-10-06 03:00:45' AS TIMESTAMP), CAST('2021-10-06 03:00:55' AS TIMESTAMP), 1);
具体实现
第一步
插入的数据是时间戳类型,因为我们统计登录信息就可以了,把时间戳做个格式转换“yyyy-MM-dd”,并且去除不需要的字段
select user_id, date_format(start_time, "yyyy-MM-dd") ts
from user_sign
where sign = 1;
结果如下:
这样看起来就省心多了,接下来开始第二步
第二步
统计用户的连续登录区间,经典老套路了。
先给时间开个窗,按照用户的ID分组
然后现在的时间和排名作减法【date_sub】,如果结果是相等的,就说明是连续登录的。上代码
select user_id,
ts,
row_number() over (partition by user_id order by ts) rn,
date_sub(ts, row_number() over (partition by user_id order by ts)) flag
from (select user_id, date_format(start_time, "yyyy-MM-dd") ts
from user_sign
where sign = 1) t;
ts表示登录的日期
rn是用了开窗排序,按照登录日期来排序
如果flag字段相等的话,就说明在这段时间是连续登录的
表中的102用户,他的登录区间是断开的,我们要怎么把1,2,3,4,5拆成1,2,3和1,2呢
只需要再次用row_number开个窗,不过这次是按照用户的id和flag进行分组,把每个用户的不同登录区间找出来
select user_id,
ts,
row_number() over (partition by t1.user_id,t1.flag order by t1.ts) cnt
from (select user_id,
ts,
row_number() over (partition by user_id order by ts) rn,
date_sub(ts, row_number() over (partition by user_id order by ts)) flag
from (select user_id, date_format(start_time, "yyyy-MM-dd") ts
from user_sign
where sign = 1) t) t1;
现在思路是不是突然“柳暗花明又一村”了?
根据我们最初对登录领金币的需求分析,对第三天登录和第七天登录的金币作标记3和7,其余的天数登录都是1。
select user_id,
case
when (cnt - 3) = 3 or (cnt - 3) % 7 = 0 then 3
when (cnt % 7) = 0 then 7
else 1 end glods
from (select user_id,
ts,
row_number() over (partition by t1.user_id,t1.flag order by t1.ts) cnt
from (select user_id,
ts,
row_number() over (partition by user_id order by ts) rn,
date_sub(ts, row_number() over (partition by user_id order by ts)) flag
from (select user_id, date_format(start_time, "yyyy-MM-dd") ts
from user_sign
where sign = 1) t) t1) t2;
然后现在进行简单的分id做sum聚合就可以了
select user_id, sum(glods) num
from (select user_id,
case
when (cnt - 3) = 3 or (cnt - 3) % 7 = 0 then 3
when (cnt % 7) = 0 then 7
else 1 end glods
from (select user_id,
ts,
row_number() over (partition by t1.user_id,t1.flag order by t1.ts) cnt
from (select user_id,
ts,
row_number() over (partition by user_id order by ts) rn,
date_sub(ts, row_number() over (partition by user_id order by ts)) flag
from (select user_id, date_format(start_time, "yyyy-MM-dd") ts
from user_sign
where sign = 1) t) t1) t2) t3
group by user_id;
至此,完成对id 的金币统计