数据:
user_id click_time
A,2020-05-15 01:30:00
A,2020-05-15 01:35:00
A,2020-05-15 02:00:00
A,2020-05-15 03:00:10
A,2020-05-15 03:05:00
B,2020-05-15 02:03:00
B,2020-05-15 02:29:40
B,2020-05-15 04:00:00
举例:
比如以A用户为例:
第一次会话
```txt
A,2020-05-15 01:30:00
A,2020-05-15 01:35:00
A,2020-05-15 02:00:00
```
第二次会话
```txt
A,2020-05-15 03:00:10
A,2020-05-15 03:05:00
```
需求:判断条件是只要两次时间差值大于30分钟就属于两次会话。
A,2020-05-15 01:30:00,1
A,2020-05-15 01:35:00,2
A,2020-05-15 02:00:00,3
A,2020-05-15 03:00:10,1
A,2020-05-15 03:05:00,2
B,2020-05-15 02:03:00,1
B,2020-05-15 02:29:40,2
B,2020-05-15 04:00:00,1
解决问题思路:
1.开窗 获取每条数据的和上一条数据的时间间隔
nvl((unix_timestamp(click_time)- unix_timestamp( lag(click_time) over(partition by user_id order by click_time)))/60,0) intervaltime
2.如果超过30min则打标
if(intervaltime>=30,1,0) num
3.开窗 通过num计算分组排序后的第一条数据到目前数据的num和
sum(num) over(partition by user_id order by click_time rows between unbounded preceding and current row ) as mark
此步完整sql
select user_id,click_time,intervaltime,num
,sum(num) over(partition by user_id order by click_time rows between unbounded preceding and current row ) as mark
from
(
select user_id
,click_time
,intervaltime
, if(intervaltime>=30,1,0) num
from(
select user_id
,click_time
,nvl((unix_timestamp(click_time)- unix_timestamp( lag(click_time) over(partition by user_id order by click_time)))/60,0) intervaltime
from user_clicklog )
tmp )tmp1
+---------+---------------------+-------------------+-----+------+
| user_id | click_time | intervaltime | num | mark |
+---------+---------------------+-------------------+-----+------+
| A | 2020-05-15 01:30:00 | 0 | 0 | 0 |
| A | 2020-05-15 01:35:00 | 5 | 0 | 0 |
| A | 2020-05-15 02:00:00 | 25 | 0 | 0 |
| A | 2020-05-15 03:00:10 | 60.16666666666666 | 1 | 1 |
| A | 2020-05-15 03:05:00 | 4.833333333333333 | 0 | 1 |
| B | 2020-05-15 02:03:00 | 0 | 0 | 0 |
| B | 2020-05-15 02:29:40 | 26.66666666666667 | 0 | 0 |
| B | 2020-05-15 04:00:00 | 90.33333333333333 | 1 | 1 |
+---------+---------------------+-------------------+-----+------+
4.第三次开窗 获取不同的row_number()排序
with tmp2 as (
select user_id,click_time,intervaltime,num
,sum(num) over(partition by user_id order by click_time rows between unbounded preceding and current row ) as mark
from
(
select user_id
,click_time
,intervaltime
, if(intervaltime>=30,1,0) num
from(
select user_id
,click_time
,nvl((unix_timestamp(click_time)- unix_timestamp( lag(click_time) over(partition by user_id order by click_time)))/60,0) intervaltime
from user_clicklog )
tmp )tmp1
)
select user_id,click_time ,row_number() over(partition by user_id,mark order by click_time)
from tmp2
;
+---------+---------------------+------------------------+
| user_id | click_time | row_number() OVER(...) |
+---------+---------------------+------------------------+
| A | 2020-05-15 01:30:00 | 1 |
| A | 2020-05-15 01:35:00 | 2 |
| A | 2020-05-15 02:00:00 | 3 |
| A | 2020-05-15 03:00:10 | 1 |
| A | 2020-05-15 03:05:00 | 2 |
| B | 2020-05-15 02:03:00 | 1 |
| B | 2020-05-15 02:29:40 | 2 |
| B | 2020-05-15 04:00:00 | 1 |
+---------+---------------------+------------------------+
到这里sql就完成了,利用了3次开窗来解决。