目录
0 需求
t1:地铁站点表
user_id in_time out_time station_id
001 2022-05-20 15:31:21 null 001
001 null 2022-05-20 16:01:05 002
001 2022-05-20 18:02:21 null 003
001 null 2022-05-20 18:22:17 001
001 2022-05-20 20:39:27 null 004
001 null 2022-05-20 21:55:33 005
t2:商场表
t2
user_id market_id scan_time
001 1001 2022-05-20 16:11:41
001 1001 2022-05-20 16:11:51
001 1001 2022-05-20 16:11:58
001 1002 2022-05-20 17:01:28
001 1003 2022-05-20 18:31:28
001 1003 2022-05-20 18:31:58
问题:统计用户每天的行动轨迹。注意:商场表中存在重复扫码问题,取最新的数据
1 数据准备
创建t1表
create table t1 as
select '001' as user_id ,'2022-05-20 15:31:21' as in_time ,null as out_time ,'001' as station_id
union all
select '001' as user_id ,null as in_time ,'2022-05-20 16:01:05' as out_time ,'002' as station_id
union all
select '001' as user_id ,'2022-05-20 18:02:21' as in_time ,null as out_time ,'003' as station_id
union all
select '001' as user_id ,null as in_time, '2022-05-20 18:22:17' as out_time ,'001' as station_id
union all
select '001' as user_id ,'2022-05-20 20:39:27' as in_time ,null as out_time ,'004' as station_id
union all
select '001' as user_id ,null as in_time ,'2022-05-20 21:55:33' as out_time ,'005' as station_id
创建t2表
create table t2 as
select '001' as user_id ,'1001' as market_id ,'2022-05-20 16:11:41' as scan_time
union all
select '001' as user_id ,'1001' as market_id ,'2022-05-20 16:11:51' as scan_time
union all
select '001' as user_id ,'1001' as market_id ,'2022-05-20 16:11:58' as scan_time
union all
select '001' as user_id ,'1002' as market_id ,'2022-05-20 17:01:28' as scan_time
union all
select '001' as user_id ,'1003' as market_id ,'2022-05-20 18:31:28' as scan_time
union all
select '001' as user_id ,'1003' as market_id ,'2022-05-20 18:31:58' as scan_time
2 数据分析
用户行为轨迹分析,一般是用户随时间轴的变化而变化,对于此类对时间比较敏感的问题,我们往往采用将所有时间放一起进行排序,然后再分析其轨迹,本题的突破点也在于此,根据以上分析,我们先将表中的所有时间字段拿出来组成一列记为tarce_time,将用户所到的站点及商场id记为trace_id,那么trace_id是随轨迹时间变化的,注意到由于用户进入一个商场会有重复扫码的问题,我们对时间进行去重取出时间最新的一条记录。很明显,我们需要先进行行转列,具体SQL如下
select
user_id,
trace_time,
trace_id
from
(
select
user_id,
in_time as trace_time,
station_id as trace_id
from
t1
union all
select
user_id,
out_time as trace_time,
station_id as trace_id
from
t1
union all
select
user_id,
max(scan_time) as trace_time, --去重取出最新一条记录的时间
market_id as trace_id
from
t2
group by user_id, market_id
) t
order by
trace_time
得到的结果如下:
user_id trace_time trace_id
001 NULL 005
001 NULL 002
001 NULL 001
001 NULL 003
001 NULL 004
001 NULL 001
001 2022-05-20 15:31:21 001
001 2022-05-20 16:01:05 002
001 2022-05-20 16:11:58 1001
001 2022-05-20 17:01:28 1002
001 2022-05-20 18:02:21 003
001 2022-05-20 18:22:17 001
001 2022-05-20 18:31:58 1003
001 2022-05-20 20:39:27 004
001 2022-05-20 21:55:33 005
步骤2:对上述结果按照用户及天的维度进行汇总分析,组成用户的轨迹,此时我们利用concat_ws(',',collect_list())对用户行为轨迹进行组合。具体最终SQL如下:
select
user_id,
to_date(trace_time) as dt,
concat_ws(',', collect_list(trace_id)) as trace
from
(
select
user_id,
trace_time,
trace_id
from
(
select
user_id,
in_time as trace_time,
station_id as trace_id
from
t1
union all
select
user_id,
out_time as trace_time,
station_id as trace_id
from
t1
union all
select
user_id,
max(scan_time) as trace_time,
market_id as trace_id
from
t2
group by user_id, market_id
) t
order by
trace_time
) t
where trace_time is not null
group by user_id,to_date(trace_time)
最终结果如下:
user_id dt trace
001 2022-05-20 001,002,1001,1002,003,001,1003,004,005
Time taken: 5.276 seconds, Fetched: 1 row(s)
3 小结
本题属于对用户轨迹分析的问题,此类问题的突破点在于把所有的时间组成一列进行排序。具体考察SQL知识点如下:
- (1)列转行(union all方式)
- (2)collect_list()函数求用户轨迹
- (3)去重取最新一条时间数据的技巧