HQL:用户连续登陆问题
问题:求一个用户连续登陆多少天,并且间隔一天也算连续。
数据:
id login
id001 2021-01-01
id001 2021-01-02
id001 2021-01-03
id001 2021-01-05
id001 2021-01-07
id001 2021-01-08
id001 2021-01-15
id001 2021-01-16
举例说明:id为id001的用户在1号至8号连续登陆8天。
建表语句:
create table tx (
id string,
login string
)
row format delimited fields terminated by '\t';
思路1 等差数列思想
1)给每条数据标记rank值:(这里的日期没有重复,因此使用Hive中的三种排序方式均可)
id001 2021-01-01 1
id001 2021-01-02 2
id001 2021-01-03 3
id001 2021-01-05 4
id001 2021-01-07 5
id001 2021-01-08 6
id001 2021-01-15 7
id001 2021-01-16 8
子查询:
select
id,
login,
rank() over(partition by id order by login) rk
from
tx; t1
2)使用login字段 - rank值:
id001 2021-01-00(举例,实际上应该为2020-12-31)
id001 2021-01-00
id001 2021-01-00
id001 2021-01-01
id001 2021-01-02
id001 2021-01-02
id001 2021-01-08
id001 2021-01-08
login字段 - rank值的结果相等代表连续
子查询:
select
id,
date_sub(login,rk) flag
from t1; t2
3)根据上一步的结果分组,求count,count的结果代表严格连续的天数:
id001 2021-01-00 3
id001 2021-01-01 1
id001 2021-01-02 2
id001 2021-01-08 2
子查询:
select
id,
flag,
count(*) ct
from t2
group by id,flag; t3
4)对结果集再进行rank:
id001 2021-01-00 3 1
id001 2021-01-01 1 2
id001 2021-01-02 2 3
id001 2021-01-08 2 4
子查询:
select
id,
flag,
ct,
rank() over (partition by id order by flag) rk
from t3; t4
5)使用时间字段 - rank:
id001 2021-01- -1 3
id001 2021-01- -1 1
id001 2021-01- -1 2
id001 2021-01- 04 2
子查询:
select
id,
date_sub(flag,rk) flag,
ct
from t4; t5
6)分组求连续天数,得到最终结果:
select
id,
flag,
count(*) ct
from t5
group by id,flag;
最终SQL:
select id,
flag,
sum(ct) + count(*) - 1 ct
from (
select id,
date_sub(flag, rk) flag,
ct
from (select id,
flag,
ct,
rank() over (partition by id order by flag) rk
from (
select id,
flag,
count(*) ct
from (
select id,
date_sub(login, rk) flag
from (
select id,
login,
rank() over (partition by id order by login) rk
from tx
) t1
) t2
group by id, flag
) t3) t4
) t5
group by id, flag;
思路2 直接分组思路
1)将上一行下移:
id001 2021-01-01 1970-01-01
id001 2021-01-02 2021-01-01
id001 2021-01-03 2021-01-02
id001 2021-01-05 2021-01-03
id001 2021-01-07 2021-01-05
id001 2021-01-08 2021-01-07
id001 2021-01-15 2021-01-08
id001 2021-01-16 2021-01-15
子查询:
select
id,
login,
lag(login,1,'1970-01-01') over (partition by id order by login) laglogin
from
tx; t1
2)用当前日期减去上一行日期:
select
id,
login,
datediff(login,laglogin) flag
from t1; t2
3)给每一行做标记,上一步的结果大于2的标记为1,否则标记为0,然后sum求和
注意:这里使用spark引擎会报错,可以将引擎切换成MR,或者建表时login的字段类型改成date。
set hive.execution.engine=mr;
select
id,
login,
sum(if(flag > 2,1,0)) over (partition by order by login) groupsum
from t2; t3
4)按照id,groupsum分组,并且使用最大日期减去最小日期+1,就是最终结果
select
id,
groupsum,
datediff(max(login),min(login)) + 1
from t3
group by
id,groupsum;
最终SQL:
select id,
groupsum,
datediff(max(login), min(login)) + 1
from (
select id,
login,
sum(if(flag > 2, 1, 0)) over (partition by id order by login) groupsum
from (
select id,
login,
datediff(login, laglogin) flag
from (
select id,
login,
lag(login, 1, '1970-01-01') over (partition by id order by login) laglogin
from tx
) t1
) t2
) t3
group by id, groupsum;