计算页面停留时长,往常计算也页面留时长都需要跑spark或这mr任务来实现,如果能跑sql来实现那是最好不过了(sql是最好的语言),废话不多说,直接搞起
注意:这里使用的计算引擎是presto
首先来创建一张用户浏览记录数据表,表结构很简单只有3个字段,uid 用户id,page 页面链接,time 进入此页面的时间戳;如果有页面离开时间,那离开时间-进入时间就是页面停留时间;想的是挺好,可惜通常情况下,页面打点只记录进入页面的时间,而不会记录页面离开时间;要计算页面停留时间就需要通过用户进入下个页面nextpage的时间减去当前页面curpage的时间这种方式计算,这也是计算页面停留时间的难点;今天我们就用sql实现这个功能
CREATE TABLE hadoop.wh.site_flow_test (
uid varchar,
page varchar,
time long )
WITH ( format = 'ORC' )
测试数据
userid1,index,1539242423
userid1,index,1539243421
userid1,page,1539243421
userid1,detail,1539213421
userid1,detail,1539223421
userid1,page,1539243121
userid1,detail,1539253421
userid2,index,1539241423
userid2,index,1539243221
userid2,page,1539243411
userid2,detail,1539242421
userid2,detail,1539240421
userid2,page,1539243221
userid2,detail,1539213421
第一步对每个用的浏览的页面数据聚合格式如下:uid,[page1,page2],[time1,time2];
这份数据的意思是用户进入page1页面的时间是time1,其中time1和time2是升序排列,具体实现如下
create table hadoop.wh.site_flow_tmp (uid,pages,arrs)
as
with a as (select uid,page,time from hadoop.wh.site_flow_test order by time )
select uid,array_agg(page),array_agg(time) from a group by uid
结果如下
uid▼pages▼arrs▼
userid2 PrestoArray{ array=[detail, detail, index, detail, index, page, page]} PrestoArray{ array=[1.539213421E9, 1.539240421E9, 1.539241423E9, 1.539242421E9, 1.539243221E9, 1.539243221E9, 1.539243411E9]}
userid1 PrestoArray{ array=[detail, detail, index, page, index, page, detail]} PrestoArray{ array=[1.539213421E9, 1.539223421E9, 1.539242423E9, 1.539243121E9, 1.539243421E9, 1.539243421E9, 1.539253421E9]}
构建开始时间数组和结束时间数组,starts字段是指进入对应pages下标页面的时间,ends是对应pages下表的离开时间;userid2用户浏览了7个页面,最后一个页面的离开时间没办法关联,所以最后一个页面的离开时间就是进入时间,计算页面停留时长的使用就会变成0
create table hadoop.wh.site_flow_tmp2 (uid,pages,starts,ends) as
select uid,pages,slice(arrs,1,cardinality(arrs)) starts ,concat(slice(arrs,2,cardinality(arrs)),slice(arrs,cardinality(arrs),1)) ends
from hadoop.wh.site_flow_tmp
uid▼pages▼starts▼ends▼
userid2
PrestoArray{ array=[detail, detail, index, detail, index, page, page]}
PrestoArray{ array=[1.539213421E9, 1.539240421E9, 1.539241423E9, 1.539242421E9, 1.539243221E9, 1.539243221E9, 1.539243411E9]}
PrestoArray{ array=[1.539240421E9, 1.539241423E9, 1.539242421E9, 1.539243221E9, 1.539243221E9, 1.539243411E9, 1.539243411E9]}
userid1
PrestoArray{ array=[detail, detail, index, page, index, page, detail]}
PrestoArray{ array=[1.539213421E9, 1.539223421E9, 1.539242423E9, 1.539243121E9, 1.539243421E9, 1.539243421E9, 1.539253421E9]} PrestoArray{ array=[1.539223421E9, 1.539242423E9, 1.539243121E9, 1.539243421E9, 1.539243421E9, 1.539253421E9, 1.539253421E9]}
用离开时间减去进入时间即可得到页面停留时间
create table hadoop.wh.site_flow_tmp3(uid,pages,starts,result) as
SELECT uid,pages,starts,zip_with(starts, ends, (x, y) -> y- x) from hadoop.wh.site_flow_tmp2
uid▼pages▼starts▼result▼
userid2
PrestoArray{ array=[detail, detail, index, detail, index, page, page]}
PrestoArray{ array=[1.539213421E9, 1.539240421E9, 1.539241423E9, 1.539242421E9, 1.539243221E9, 1.539243221E9, 1.539243411E9]}
PrestoArray{ array=[27000.0, 1002.0, 998.0, 800.0, 0.0, 190.0, 0.0]}
userid1
PrestoArray{ array=[detail, detail, index, page, index, page, detail]}
PrestoArray{ array=[1.539213421E9, 1.539223421E9, 1.539242423E9, 1.539243121E9, 1.539243421E9, 1.539243421E9, 1.539253421E9]}
PrestoArray{ array=[10000.0, 19002.0, 698.0, 300.0, 0.0, 10000.0, 0.0]}
数据平铺,搞定
with a as (SELECT uid,pages,starts,result from hadoop.wh.site_flow_tmp3 )
select uid,page,starttime,pagetime from a CROSS JOIN UNNEST(pages,starts,result) t (page,starttime,pagetime)
uid▼page▼starttime▼pagetime▼
userid2 detail 1539213421 27000
userid2 detail 1539240421 1002
userid2 index 1539241423 998
userid2 detail 1539242421 800
userid2 index 1539243221 0
userid2 page 1539243221 190
userid2 page 1539243411 0
userid1 detail 1539213421 10000
userid1 detail 1539223421 19002
userid1 index 1539242423 698
userid1 page 1539243121 300
userid1 index 1539243421 0
userid1 page 1539243421 10000
userid1 detail 1539253421 0