问题:con_table(user_id,ttime) ttime为用户登陆时间,现在需要找出来连续登陆时间天数超过3天的用户
create table con_table (
user_id int not null,
ttime datetime not null);
insert into con_table values (1,'2019-07-07 10:00:01');
insert into con_table values (1,'2019-07-07 11:00:01');
insert into con_table values (1,'2019-07-07 12:00:01');
insert into con_table values (1,'2019-07-08 10:00:01');
insert into con_table values (1,'2019-07-08 11:00:01');
insert into con_table values (1,'2019-07-09 10:00:01');
insert into con_table values (1,'2019-07-11 10:00:01');
insert into con_table values (1,'2019-07-12 10:00:01');
insert into con_table values (1,'2019-07-20 10:00:01');
insert into con_table values (1,'2019-07-21 10:00:01');
insert into con_table values (1,'2019-07-21 11:00:01');
insert into con_table values (1,'2019-07-22 10:00:01');
insert into con_table values (1,'2019-07-23 10:00:01');
insert into con_table values (2,'2019-07-07 10:00:01');
insert into con_table values (2,'2019-07-07 11:00:01');
insert into con_table values (2,'2019-07-08 12:00:01');
insert into con_table values (2,'2019-07-09 10:00:01');
insert into con_table values (2,'2019-07-10 11:00:01');
insert into con_table values (2,'2019-07-12 10:00:01');
insert into con_table values (2,'2019-07-14 10:00:01');
insert into con_table values (3,'2019-07-10 10:00:01');
insert into con_table values (3,'2019-07-11 11:00:01');
insert into con_table values (3,'2019-07-11 12:00:01');
insert into con_table values (3,'2019-07-11 13:00:01');
insert into con_table values (3,'2019-07-11 14:00:01');
insert into con_table values (3,'2019-07-20 10:00:01');
第一步:我们的时间是精确到秒的,也就是我们用户可能一天登陆多次,所以第一步要对userid和ttime去重复
select
distinct
user_id,
date_format(ttime,'%y-%m-%d') as days
from con_table
hive中可以用to_date?或者yy-mm-dd?【待定确认】
第二步:基于上面的表,对每个用户,每天排序
select
user_id,
days,
(select count(days)
from
(
select distinct user_id,date_format(ttime,'%y-%m-%d') as days
from con_table
) t2
where t2.user_id = t1.user_id and t2.days > t1.days
) + 1 as rnk
from
(
select
distinct
user_id,
date_format(ttime,'%y-%m-%d') as days
from con_table
) as t1;
第三步:添加上我们所有需要的信息
第四步:对userid,index进行groupby
最后:再having count(*) >= 3就好啦
参考链接https://zhuanlan.zhihu.com/p/49285570
我这个例子比参考连接复杂了一点,因为把时间具体化了,
反正核心思路,就是对date倒序排列rnk【连续的】,然后max(date)-rnk【有意义的连续的】,date-max(date)-rnk【如果是同一个值就是连续的,不是同一个值就不是连续的】
https://www.qingtingip.com/h_240463.html
稍微简单的做法:直接date正序row_number,然后date和它相剪,如果是连续的,会得到同一个值
窗口函数做法:
select
distinct u_id
from
(
select
user_id
,date
,lag(date,6) over(partition by u_id order by date) as lag2_date
from t
) as tt
where datediff(tt,lag2_date) = 6
连续7天下过单,是这样,如果这个用户真的连续7天,那么最后一天往前回溯6天,保证date差是6就可以了
涉及到日期精确到s时,看下面的去重做法
总结:当明确指出连续3/7天登陆的时候,用lag即可,当不指出让你算每个用户连续登陆天数的时候,就用dt-row_number然后group by 就好,【max(dt)-min(dt)】就是我们要的时间差