问题:
用户登入、退出app的日志表,假设里面只有一天的数据,计算以下数据:
1、24个小时,每个小时的最高同时在线人数
2、当日的最高同时在线人数
3、当日的最高同时在线时间段
uid | event_type(1-登入,2-退出) | event_time(时间戳) |
a | 1 | 2022/9/1 10:00 |
a | 2 | 2022/9/1 10:04 |
解决思路:
1、模拟测试数据:
WITH test_tab as (
select 'A' uid,
from_unixtime(
unix_timestamp('2020-08-01 10:00:00', "yyyy-MM-dd HH:mm:ss")
) event_time,
'1' event_type
union all
select 'B' uid,
from_unixtime(
unix_timestamp('2020-08-01 10:01:00', "yyyy-MM-dd HH:mm:ss")
) event_time,
'1' event_type
union all
select 'C' uid,
from_unixtime(
unix_timestamp('2020-08-01 10:02:03', "yyyy-MM-dd HH:mm:ss")
) event_time,
'1' event_type
union all
select 'D' uid,
from_unixtime(
unix_timestamp('2020-08-01 10:04:03', "yyyy-MM-dd HH:mm:ss")
) event_time,
'1' event_type
union all
select 'A' uid,
from_unixtime(
unix_timestamp('2020-08-01 10:03:10', "yyyy-MM-dd HH:mm:ss")
) event_time,
'2' event_type
union all
select 'C' uid,
from_unixtime(
unix_timestamp('2020-08-01 10:03:11', "yyyy-MM-dd HH:mm:ss")
) event_time,
'2' event_type
union all
select 'A' uid,
from_unixtime(
unix_timestamp('2020-08-01 10:04:00', "yyyy-MM-dd HH:mm:ss")
) event_time,
'1' event_type
union all
select 'F' uid,
from_unixtime(
unix_timestamp('2020-08-01 12:04:06', "yyyy-MM-dd HH:mm:ss")
) event_time,
'1' event_type
union all
select 'G' uid,
from_unixtime(
unix_timestamp('2020-08-01 12:04:10', "yyyy-MM-dd HH:mm:ss")
) event_time,
'1' event_type
union all
select 'H' uid,
from_unixtime(
unix_timestamp('2020-08-01 13:04:12', "yyyy-MM-dd HH:mm:ss")
) event_time,
'1' event_type
union all
select 'I' uid,
from_unixtime(
unix_timestamp('2020-08-01 15:05:10', "yyyy-MM-dd HH:mm:ss")
) event_time,
'2' event_type
)
2、把原始数据中的uid 和登录时间(event_time)取出来,然后给一个 flag,每个用户的登录时都给一个1,然后union all 退出时间,给结束时间(event_time)时一个-1,这样我们就能在这个子查询的外层以sum()和开窗函数(over)配合。
select uid,
event_time,
sum(flag) over (
order by event_time
) as online_user_cnt
from (
select uid,
event_time,
case
when event_type = '1' then 1
when event_type = '2' then -1
else 0
end flag
from test_tab
) a
结果:
产生问题:考虑到模拟的数据集中存在一些小时内没有任何用户进出的情况,而导致没有每个小时的累加数据。因此需要单独在模拟数据集中加上缺失小时的记录,flag=0,不影响最高同时在线人数的统计。
处理方式:
a)新建一张小时维表,dim_time,取2020-08-01 每个小时,建表语句如下:
dim_time as (
select date_format(
from_unixtime(unix_timestamp(start_date) + pos * 3600),
'yyyy-MM-dd HH:mm:ss'
) as time_hour
from(
select '1' as uid,
'2020-08-01 00:00:00' as start_date,
'2020-08-01 23:00:00' as end_date
) tmp lateral view posexplode(
split(
space(
hour(end_date) - hour(start_date) +(
datediff(
end_date,
start_date
)
) * 24
),
''
)
) t as pos,
val
)
b)关联维表找到模拟数据集中缺失的时间段,把数据补齐
rs_test as (
select 'A' AS uid,
time_hour AS event_time,
0 AS flag
from dim_time t1
left join test_tab t2 --
on substr(time_hour, 1, 13) = substr(event_time, 1, 13)
where event_time is null
union ALL
select uid,
event_time,
case
when event_type = '1' then 1
when event_type = '2' then -1
else 0
end flag
from test_tab
)
3、计算每小时最高同时在线人数
代码:
select date(event_time) as login_date,
hour(event_time) as login_hour,
max(online_user_cnt) as online_user_cnt_max
from (
select uid,
event_time,
sum(flag) over (
order by event_time
) as online_user_cnt
from rs_test
) b
group by date(event_time),
hour(event_time)
结果:
4 、计算当天最高同时在线人数
代码:
select date(event_time) as login_date,
max(online_user_cnt) as online_user_cnt_max
from (
select uid,
event_time,
sum(flag) over (
order by event_time
) as online_user_cnt
from rs_test
) b
group by date(event_time)
结果:
5、计算当天最高同时在线人数的时间段
代码
SELECT
max_cur_cnt,
event_time AS start_time,
lead_event_time AS end_time
FROM
(
SELECT
uid,
event_time,
flag,
cur_cnt,
max_cur_cnt,
lead ( event_time, 1, event_time ) OVER ( ORDER BY event_time ) lead_event_time
FROM
(
SELECT
uid,
event_time,
flag,
cur_cnt,
max( cur_cnt ) OVER ( ) AS max_cur_cnt
FROM
(
SELECT
uid,
event_time,
flag,
sum( flag ) OVER ( ORDER BY event_time ) AS cur_cnt
FROM rs_test t
) m
) n --求出最大在线人数
) p --lead函数求出在线人数最大的时间段(lead向后取N,lag向前取N)
WHERE
CUR_CNT = MAX_CUR_CNT
结果: