HQL面试题55:阿里最新面试题之地铁站用户行为轨迹分析

49 篇文章 222 订阅
48 篇文章 119 订阅

目录

0 需求

1 数据准备

2 数据分析

3 小结


0 需求

t1:地铁站点表

user_id		in_time	    				out_time          station_id
001    		2022-05-20 15:31:21  		null      			001
001    		null    					2022-05-20 16:01:05 002
001    		2022-05-20 18:02:21  		null      			003
001      	null     					2022-05-20 18:22:17 001
001    		2022-05-20 20:39:27  		null      			004
001      	null     					2022-05-20 21:55:33 005

t2:商场表

t2
user_id market_id  scan_time
001    1001      2022-05-20 16:11:41
001    1001      2022-05-20 16:11:51
001    1001      2022-05-20 16:11:58
001    1002      2022-05-20 17:01:28
001    1003      2022-05-20 18:31:28
001    1003      2022-05-20 18:31:58

问题:统计用户每天的行动轨迹。注意:商场表中存在重复扫码问题,取最新的数据

1 数据准备

创建t1表

create table t1 as 
select '001' as user_id ,'2022-05-20 15:31:21' as in_time ,null as out_time ,'001' as station_id
union all      
select '001' as user_id ,null as in_time ,'2022-05-20 16:01:05' as out_time ,'002' as station_id
union all      
select '001' as user_id ,'2022-05-20 18:02:21' as in_time ,null as out_time ,'003' as station_id
union all      
select '001' as user_id ,null as in_time, '2022-05-20 18:22:17' as out_time ,'001' as station_id
union all      
select '001' as user_id ,'2022-05-20 20:39:27' as in_time ,null as out_time ,'004' as station_id
union all      
select '001' as user_id ,null as in_time ,'2022-05-20 21:55:33' as out_time ,'005' as station_id


创建t2表

create table t2 as 
select '001' as user_id ,'1001' as market_id ,'2022-05-20 16:11:41' as scan_time 
union all                                   
select '001' as user_id ,'1001' as market_id ,'2022-05-20 16:11:51' as scan_time
union all                                   
select '001' as user_id ,'1001' as market_id ,'2022-05-20 16:11:58' as scan_time 
union all                                   
select '001' as user_id ,'1002' as market_id ,'2022-05-20 17:01:28' as scan_time 
union all                                   
select '001' as user_id ,'1003' as market_id ,'2022-05-20 18:31:28' as scan_time
union all                                   
select '001' as user_id ,'1003' as market_id ,'2022-05-20 18:31:58' as scan_time 

2 数据分析

用户行为轨迹分析,一般是用户随时间轴的变化而变化,对于此类对时间比较敏感的问题,我们往往采用将所有时间放一起进行排序,然后再分析其轨迹,本题的突破点也在于此,根据以上分析,我们先将表中的所有时间字段拿出来组成一列记为tarce_time,将用户所到的站点及商场id记为trace_id,那么trace_id是随轨迹时间变化的,注意到由于用户进入一个商场会有重复扫码的问题,我们对时间进行去重取出时间最新的一条记录。很明显,我们需要先进行行转列,具体SQL如下

   select
      user_id,
      trace_time,
      trace_id
    from
      (
        select
          user_id,
          in_time as trace_time,
          station_id as trace_id
        from
          t1
        union all
        select
          user_id,
          out_time as trace_time,
          station_id as trace_id
        from
          t1
        union all
        select
          user_id,
          max(scan_time) as trace_time, --去重取出最新一条记录的时间
          market_id as trace_id
        from
          t2
        group by user_id, market_id
      ) t
    order by
      trace_time

得到的结果如下:

user_id trace_time      trace_id
001     NULL    005
001     NULL    002
001     NULL    001
001     NULL    003
001     NULL    004
001     NULL    001
001     2022-05-20 15:31:21     001
001     2022-05-20 16:01:05     002
001     2022-05-20 16:11:58     1001
001     2022-05-20 17:01:28     1002
001     2022-05-20 18:02:21     003
001     2022-05-20 18:22:17     001
001     2022-05-20 18:31:58     1003
001     2022-05-20 20:39:27     004
001     2022-05-20 21:55:33     005

步骤2:对上述结果按照用户及天的维度进行汇总分析,组成用户的轨迹,此时我们利用concat_ws(',',collect_list())对用户行为轨迹进行组合。具体最终SQL如下:

select
  user_id,
  to_date(trace_time) as dt,
  concat_ws(',', collect_list(trace_id)) as trace
from
  (
    select
      user_id,
      trace_time,
      trace_id
    from
      (
        select
          user_id,
          in_time as trace_time,
          station_id as trace_id
        from
          t1
        union all
        select
          user_id,
          out_time as trace_time,
          station_id as trace_id
        from
          t1
        union all
        select
          user_id,
          max(scan_time) as trace_time,
          market_id as trace_id
        from
          t2
        group by user_id, market_id
      ) t
    order by
      trace_time
) t
where trace_time is not null
group by user_id,to_date(trace_time)

最终结果如下:

user_id dt              trace
001     2022-05-20      001,002,1001,1002,003,001,1003,004,005
Time taken: 5.276 seconds, Fetched: 1 row(s)

3 小结

本题属于对用户轨迹分析的问题,此类问题的突破点在于把所有的时间组成一列进行排序。具体考察SQL知识点如下:

  • (1)列转行(union all方式)
  • (2)collect_list()函数求用户轨迹
  • (3)去重取最新一条时间数据的技巧
  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值