数据:
1001 2020-09-10 10:21:21 home.html
1001 2020-09-10 10:28:10 good_list.html
1001 2020-09-10 10:35:05 good_detail.html
1001 2020-09-10 10:42:55 cart.html
1001 2020-09-10 11:35:21 home.html
1001 2020-09-10 11:36:10 cart.html
1001 2020-09-10 11:38:12 trade.html
1001 2020-09-10 11:38:55 payment.html
1002 2020-09-10 09:40:00 home.html
1002 2020-09-10 09:41:00 mine.html
1002 2020-09-10 09:42:00 favor.html
1003 2020-09-10 13:10:00 home.html
1003 2020-09-10 13:15:00 search.html
需求:
求得用户每次会话的行为轨迹(同一用户,上一条与下一条在半小时内为一次会话)
要求结果如下:
A 1001 2020-09-10 10:21:21 home.html 1
A 1001 2020-09-10 10:28:10 good_list.html 2
A 1001 2020-09-10 10:35:05 good_detail.html 3
A 1001 2020-09-10 10:42:55 cart.html 4
B 1001 2020-09-10 11:35:21 home.html 1
B 1001 2020-09-10 11:36:10 cart.html 2
B 1001 2020-09-10 11:38:12 trade.html 3
B 1001 2020-09-10 11:38:55 payment.html 4
C 1002 2020-09-10 09:40:00 home.html 1
C 1002 2020-09-10 09:41:00 mine.html 2
C 1002 2020-09-10 09:42:00 favor.html 3
D 1003 2020-09-10 13:10:00 home.html 1
D 1003 2020-09-10 13:15:00 search.html 2
一、sql语句:
1、建表:
create table page_session(
user_id string,
page_time string,
page string)
row format delimited
fields terminated by '\t';
2、按时间排序-升序,按user_id分组,使用开窗函数lag()获得上一条的时间
select
user_id,
page_time,
page,
lag(page_time) over(partition by user_id order by page_time asc) before_time
from page_session;
结果如下:
user_id page_time page before_time
1001 2020-09-10 10:21:21 home.html NULL
1001 2020-09-10 10:28:10 good_list.html 2020-09-10 10:21:21
1001 2020-09-10 10:35:05 good_detail.html 2020-09-10 10:28:10
1001 2020-09-10 10:42:55 cart.html 2020-09-10 10:35:05
1001 2020-09-10 11:35:21 home.html 2020-09-10 10:42:55
1001 2020-09-10 11:36:10 cart.html 2020-09-10 11:35:21
1001 2020-09-10 11:38:12 trade.html 2020-09-10 11:36:10
1001 2020-09-10 11:38:55 payment.html 2020-09-10 11:38:12
1002 2020-09-10 09:40:00 home.html NULL
1002 2020-09-10 09:41:00 mine.html 2020-09-10 09:40:00
1002 2020-09-10 09:42:00 favor.html 2020-09-10 09:41:00
1003 2020-09-10 13:10:00 home.html NULL
1003 2020-09-10 13:15:00 search.html 2020-09-10 13:10:00
3、给每个会话添加会话id
两个条件判断是否为新会话:
1、before_time是否为null
2、判断上一次时间与