在SQL相关的面试中,会被问到如题的问题,今天介绍下想到的一些方法,如有其他方法欢迎交流。
一、准备数据
--创建表
create table tmp_20201128
(
id bigint,
log_time string
);
--插入数据
insert into tmp_20201128
select 1 as id,'2020-01-01 09:00:00' as log_time
union all
select 1 as id,'2020-01-01 10:00:00' as log_time
union all
select 1 as id,'2020-01-02 10:00:00' as log_time
union all
select 1 as id,'2020-01-02 11:00:00' as log_time
union all
select 1 as id,'2020-01-02 12:00:00' as log_time
union all
select 1 as id,'2020-01-03 12:00:00' as log_time
union all
select 1 as id,'2020-01-04 12:00:00' as log_time
union all
select 1 as id,'2020-01-05 12:00:00' as log_time
union all
select 2 as id,'2020-01-01 09:00:00' as log_time
union all
select 2 as id,'2020-01-01 10:00:00' as log_time
union all
select 2 as id,'2020-01-02 10:00:00' as log_time
union all
select 3 as id,'2020-01-01 10:00:00' as log_time
union all
select 3 as id,'2020-01-03 12:00:00' as log_time
union all
select 3 as id,'2020-01-04 12:00:00' as log_time
union all
select 3 as id,'2020-01-05 12:00:00' as log_time
;
二、分析数据
查询原始数据,如下图

因准备的测试数据较少,在此限定N=3。通过分析可以等到,id为1和3的用户为最后的结果数据。
三、处理方法
1、第一种方法:row_number
- 处理流程
①每个用户每天只保留一条数据
②使用row_number函数按照id进行分组时间升序排序,获得排序字段rnk
③日期减去排序字段的天数,获得初始时间字段begin_time
④按照id和begin_time进行count聚合,保留count大于2或者大于等于3的用户即为结果数据
- 具体sql
select distinct
id
from
(
select
id
--具体日期函数在不同的数据库中略有差异
,dateadd(to_date(log_date,'yyyy-mm-dd'),-rnk,'dd') as begin_time
,count(1) as cnt
from
(
select
id
,log_date
,row_number()over(partition by id order by log_date) as rnk
from
(
select distinct id,substr(log_time,1,10) as log_date
from tmp_20201128
) t1
) t1
group by
id
,dateadd(to_date(log_date,'yyyy-mm-dd'),-rnk,'dd')
having
count(1) > 2
) t1
;
- 执行结果

二、第二种方法:lag或lead函数
因为lag函数和lead函数相差不大,这里只以lag函数作为例子。
- 处理流程
①每个用户每天只保留一条数据
②使用lag函数按照id分组日期升序排序取前面第二条数据的日期,且使用本条数据的日期减去2天获得完整链路的日期
③通过对比②步骤获得的日期,如果两个日期相等那么该用户为连续登陆的用户
- 具体sql
select
distinct id
from
(
select
id
,log_date
,substr(dateadd(to_date(log_date,'yyyy-mm-dd'),-2,'dd'),1,10) as math_log_date
,lag(log_date,2)over(partition by id order by log_date) as last_log_date
from
(
select distinct id,substr(log_time,1,10) as log_date
from tmp_20201128
) t1
) t1
where
math_log_date = last_log_date
;
- 执行结果

三、第三种方法:使用自关联
- 处理流程
①每个用户每天只保留一条数据,放入临时结果集(后续对该结果集进行操作)
②临时结果集分为t1和t2表,按照id进行自关联,并限定t2表的时间在t1表时间减2天与t1表时间范围内
③按照id与时间进行count聚合,保留count结果大于2或者大于等于3的用户即为联系登录的用户
- 具体sql
with tmp_20201128_date as
(
select distinct id,substr(log_time,1,10) as log_date
from tmp_20201128
)
select distinct
id
from
(
select
id
,log_date
,count(1)
from
(
select
t1.id
,t1.log_date
,t2.log_date as log_date_d
from
tmp_20201128_date t1
inner join
tmp_20201128_date t2
on t1.id = t2.id
where
t2.log_date between substr(dateadd(to_date(t1.log_date,'yyyy-mm-dd'),-2,'dd'),1,10) and t1.log_date
) t1
group by
id
,log_date
having count(1)>2
) t1
;
- 执行结果

以上就是要分享的全部内容,欢迎交流。
本文介绍了三种SQL查询方法来找出连续三天登录的用户。方法包括使用row_number函数、lag函数及自关联技巧。这些方法适用于面试和技术交流。
1917

被折叠的 条评论
为什么被折叠?



