提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
文章目录
目录
1.找出用户的行动轨迹(在每个区域停留的开始时间和结束时间)
前言
这是一道很新颖的面试题,不像留存和连续登陆那么经典
可以试着学习一下,可以锻炼sql查询逻辑
参考b站董旭阳TonyDong董老师的SQL面试题:基于扫码记录查找密接人员
提示:以下是本篇文章正文内容,下面案例可供参考
一、数据源
用户扫码记录存储在trail表里
二、解题步骤
1.找出用户的行动轨迹(在每个区域停留的开始时间和结束时间)
1,运用窗口函数查出diff字段(因为同一个diiff代表在一个区域)
代码:
select uid,area,scan_time,
row_number() over(partition by uid order by scan_time) num1,
row_number() over(partition by uid,area order by scan_time) num2,
row_number() over(partition by uid order by scan_time)-row_number() over(partition by uid,area order by scan_time) diff
from trail
得到的表:
num1:用户的扫描时间的排序可以代表用户的轨迹,但一个用户可能会在多个区域之间跨越。所以在同一个区域的开始时间和结束时间会有多个。
num2:将每一个用户在每一个区域的扫码时间进行排序。
diff:相邻diff的值相同表示该用户在一个区域内,diff值突然变化,表示用户有所移动。
将以上的查询定义为一个临时表
with t1 as
(select uid,area,scan_time,
row_number() over(partition by uid order by scan_time) num1,
row_number() over(partition by uid,area order by scan_time) num2,
row_number() over(partition by uid order by scan_time)-row_number() over(partition by uid,area order by scan_time) diff
from trail)
2,用聚合函数找出在一个区域内的开始时间和结束时间
注意:这里的一个区域不是指同一个地方,是指在某个地方停留,即使是去过两次的地方。
代码:
with t1 as
(select uid,area,scan_time,
row_number() over(partition by uid order by scan_time) num1,
row_number() over(partition by uid,area order by scan_time) num2,
row_number() over(partition by uid order by scan_time)-row_number() over(partition by uid,area order by scan_time) diff
from trail)
select uid,area,min(scan_time),max(scan_time)
from t1
group by uid,area,diff
得到的表:
知识点:
1,为什么group三个字段?
不管怎样都是一个一个uid对应的一个area对应一个扫描时间,只不过聚合依据不同,分到的区也不同,每个区的最大最小值也不同。
如果只聚合依据为uid,area这两个字段,没法判断什么时候跳到另一个area。聚合依据为uid,area,diff这三个字段的时候,一个uid对应的一个area因为diff的不同进行了分区,就能在这个分区里找最大最小值。
就如表里1301111111的id对应两条A001 的数据,表明去过两次A001。如果没有diff的聚合,1301111111的id就会被误判为之去过一次A001。
3,order by排序
可以得到整一个通过时间推移的用户轨迹(scan_time的字段是连续的,area是跳跃的)
代码:
with t1 as
(select uid,area,scan_time,
row_number() over(partition by uid order by scan_time) num1,
row_number() over(partition by uid,area order by scan_time) num2,
row_number() over(partition by uid order by scan_time)-row_number() over(partition by uid,area order by scan_time) diff
from trail)
select uid,area,min(scan_time),max(scan_time)
from t1
group by uid,area,diff
order by uid,scan_time
得到的表:
2.找出时空伴随人员
1,伴随规则
1,在阳性人员停留半小时以上的区域,用户停留了半小时以上。
2,并且停留时间和阳性人员有十分钟以上的交集。
如上图所示,用户2可以被确定为用户1的时空伴随人员(密接者)。
2,自链接查找密接
1,将上面写好的用户行为轨迹的表格定义为临时表t2,并且过滤掉在区域内停留时间少于30min的部分轨迹。
代码:
with t1 as
(select uid,area,scan_time,
row_number() over(partition by uid order by scan_time) num1,
row_number() over(partition by uid,area order by scan_time) num2,
row_number() over(partition by uid order by scan_time)-row_number() over(partition by uid,area order by scan_time) diff
from trail),
t2 as
(select uid,area,min(scan_time) start_time,max(scan_time) end_time
from t1
group by uid,area,diff
having min(scan_time)+interval 30min <= max(scan_time)
order by uid,scan_time)
得到的表:
知识点:
interval后面可以跟分为单位的,并且可以直接时间字段+interval这样计算
2,自链接寻找和阳性人员有十分钟以上的交集
假设13011111111是确诊人员,将t2设置为u1表只有13011111111
with t1 as
(select uid,area,scan_time,
row_number() over(partition by uid order by scan_time) num1,
row_number() over(partition by uid,area order by scan_time) num2,
row_number() over(partition by uid order by scan_time)-row_number() over(partition by uid,area order by scan_time) diff
from trail),
t2 as
(select uid,area,min(scan_time) start_time,max(scan_time) end_time
from t1
group by uid,area,diff
having min(scan_time)+interval 30min <= max(scan_time)
order by uid,scan_time),
select *
from t2 u1
join t2 u2
on(.......#链接键下一步讲)
where u1.uid="13011111111"
链接键该怎么写呢?
连接键1:首先确证人员13011111111不用跟自己连,连接键2:其次得在一个区域的人相连,连接键3:然后确证人员13011111111先到达某个区域的情况,连接键4:确证人员13011111111后到达某个区域的情况。
知识点:
1,u1虽然是表链接后过滤的,也可以当作先过滤掉再来看连接键。
2,关于链接键3,4:因为之前在t2里面已经过滤掉了没待够30min的,u1,u2都是待够30min的记录相连,只要保证在密接到达该区域十分钟前没离开就ok。这两个连接键可以看作是取交集的一个方式。
代码:
with t1 as
(select uid,area,scan_time,
row_number() over(partition by uid order by scan_time) num1,
row_number() over(partition by uid,area order by scan_time) num2,
row_number() over(partition by uid order by scan_time)-row_number() over(partition by uid,area order by scan_time) diff
from trail),
t2 as
(select uid,area,min(scan_time) start_time,max(scan_time) end_time
from t1
group by uid,area,diff
having min(scan_time)+interval 30minute <= max(scan_time)
order by uid,scan_time),
select *
from t2 u1
join t2 u2
on(u1.uid<>u2.uid and u1.area=u2.area
and u1.start_time+interval 10 minute <= u2.end_time
and u2.start_time+interval 10 minute <= u1.end_time)
where u1.uid="13011111111"
得到的表:
总结
这道题目主要考查的是窗口函数排序然后序号相减找规律得到我们想要的一个表的结构,其次是使用自链接,根据实际情况添加连接键来控制交集。