题目:
业务背景
现有收集到用户的页面点击行为日志数据,数据格式如下:
用户id, 点击时间
user_id click_time
A,2020-05-15 01:30:00
A,2020-05-15 01:35:00
A,2020-05-15 02:00:00
A,2020-05-15 03:00:10
A,2020-05-15 03:05:00
B,2020-05-15 02:03:00
B,2020-05-15 02:29:40
B,2020-05-15 04:00:00
业务:
会话概念:用户的一次会话含义是指用户进入系统开始到用户离开算作一次会话,离开或者重新开始一次会话的概念是指用户的两次行为事件差值大于30分钟,
比如以A用户为例:
第一次会话:
A,2020-05-15 01:30:00
A,2020-05-15 01:35:00
A,2020-05-15 02:00:00
第二次会话:
A,2020-05-15 03:00:10
A,2020-05-15 03:05:00
判断条件是只要两次时间差值大于30分钟就属于两次会话。
需求:
对用户的日志数据打上会话内序号,如下:
A,2020-05-15 01:30:00,1
A,2020-05-15 01:35:00,2
A,2020-05-15 02:00:00,3
A,2020-05-15 03:00:10,1
A,2020-05-15 03:05:00,2
B,2020-05-15 02:03:00,1
B,2020-05-15 02:29:40,2
B,2020-05-15 04:00:00,1
实现:
在Hive中完成数据加载
–创建表droptable if exists user_clicklog;
create table user_clicklog (
user_id string,
click_time string )
row format delimited fields terminated by",";
–加载数据
load data local inpath ‘/root/impala_data/clicklog.dat’ into table user_clicklog;
使用Impala sql完成指标统计…
思路:
1.使用lag函数,按user_id分区,计算2次点击时间的间隔;
2.使用case when,计算flag字段,间隔大于30分钟的置为1,其它置为0;
3.使用sum函数按user_id分区计算flag的和;
4.按照user_id和sumflag分区即可。
sql:
with tmp as (
SELECT
user_id,
click_time,
sum(flag) over (
PARTITION BY user_id
ORDER BY
click_time
) sumflag
FROM
(
SELECT
user_id,
click_time,
CASE
WHEN sub_min >= 30 THEN 1
ELSE 0
END flag
FROM
(
SELECT
user_id,
click_time,
(
unix_timestamp(click_time) - unix_timestamp(
lag (click_time) over (
PARTITION BY user_id
ORDER BY
click_time
)
)
) / 60 sub_min
FROM
user_clicklog
) t1
) t2
)
select
user_id,
click_time,
row_number() over(
partition by user_id,
sumflag
order by
click_time
) index
from
tmp
最终结果:
+---------+---------------------+-------+
| user_id | click_time | index |
+---------+---------------------+-------+
| A | 2020-05-15 01:30:00 | 1 |
| A | 2020-05-15 01:35:00 | 2 |
| A | 2020-05-15 02:00:00 | 3 |
| A | 2020-05-15 03:00:10 | 1 |
| A | 2020-05-15 03:05:00 | 2 |
| B | 2020-05-15 02:03:00 | 1 |
| B | 2020-05-15 02:29:40 | 2 |
| B | 2020-05-15 04:00:00 | 1 |
+---------+---------------------+-------+