计算页面停留时长的另类方式

计算页面停留时长,往常计算也页面留时长都需要跑spark或这mr任务来实现,如果能跑sql来实现那是最好不过了(sql是最好的语言),废话不多说,直接搞起

注意:这里使用的计算引擎是presto

首先来创建一张用户浏览记录数据表,表结构很简单只有3个字段,uid 用户id,page 页面链接,time 进入此页面的时间戳;如果有页面离开时间,那离开时间-进入时间就是页面停留时间;想的是挺好,可惜通常情况下,页面打点只记录进入页面的时间,而不会记录页面离开时间;要计算页面停留时间就需要通过用户进入下个页面nextpage的时间减去当前页面curpage的时间这种方式计算,这也是计算页面停留时间的难点;今天我们就用sql实现这个功能

CREATE TABLE hadoop.wh.site_flow_test ( 
uid varchar, 
page varchar, 
time long ) 
WITH ( format = 'ORC' )

测试数据

userid1,index,1539242423
userid1,index,1539243421
userid1,page,1539243421
userid1,detail,1539213421
userid1,detail,1539223421
userid1,page,1539243121
userid1,detail,1539253421
userid2,index,1539241423
userid2,index,1539243221
userid2,page,1539243411
userid2,detail,1539242421
userid2,detail,1539240421
userid2,page,1539243221
userid2,detail,1539213421

第一步对每个用的浏览的页面数据聚合格式如下:uid,[page1,page2],[time1,time2];
这份数据的意思是用户进入page1页面的时间是time1,其中time1和time2是升序排列,具体实现如下

create table hadoop.wh.site_flow_tmp (uid,pages,arrs)
as
with a as (select uid,page,time from hadoop.wh.site_flow_test order by time  )
select uid,array_agg(page),array_agg(time) from a group by uid
结果如下

uid▼pages▼arrs▼

userid2	PrestoArray{ array=[detail, detail, index, detail, index, page, page]}	PrestoArray{ array=[1.539213421E9, 1.539240421E9, 1.539241423E9, 1.539242421E9, 1.539243221E9, 1.539243221E9, 1.539243411E9]}
userid1	PrestoArray{ array=[detail, detail, index, page, index, page, detail]}	PrestoArray{ array=[1.539213421E9, 1.539223421E9, 1.539242423E9, 1.539243121E9, 1.539243421E9, 1.539243421E9, 1.539253421E9]}

构建开始时间数组和结束时间数组,starts字段是指进入对应pages下标页面的时间,ends是对应pages下表的离开时间;userid2用户浏览了7个页面,最后一个页面的离开时间没办法关联,所以最后一个页面的离开时间就是进入时间,计算页面停留时长的使用就会变成0

create table hadoop.wh.site_flow_tmp2 (uid,pages,starts,ends) as
select uid,pages,slice(arrs,1,cardinality(arrs)) starts ,concat(slice(arrs,2,cardinality(arrs)),slice(arrs,cardinality(arrs),1)) ends
from hadoop.wh.site_flow_tmp 


uid▼pages▼starts▼ends▼
userid2	
PrestoArray{ array=[detail, detail, index, detail, index, page, page]}	
PrestoArray{ array=[1.539213421E9, 1.539240421E9, 1.539241423E9, 1.539242421E9, 1.539243221E9, 1.539243221E9, 1.539243411E9]}	
PrestoArray{ array=[1.539240421E9, 1.539241423E9, 1.539242421E9, 1.539243221E9, 1.539243221E9, 1.539243411E9, 1.539243411E9]}
userid1	
PrestoArray{ array=[detail, detail, index, page, index, page, detail]}	
PrestoArray{ array=[1.539213421E9, 1.539223421E9, 1.539242423E9, 1.539243121E9, 1.539243421E9, 1.539243421E9, 1.539253421E9]}	PrestoArray{ array=[1.539223421E9, 1.539242423E9, 1.539243121E9, 1.539243421E9, 1.539243421E9, 1.539253421E9, 1.539253421E9]}




用离开时间减去进入时间即可得到页面停留时间

create table hadoop.wh.site_flow_tmp3(uid,pages,starts,result) as
SELECT uid,pages,starts,zip_with(starts, ends, (x, y) -> y- x) from hadoop.wh.site_flow_tmp2 


uid▼pages▼starts▼result▼
userid2	
PrestoArray{ array=[detail, detail, index, detail, index, page, page]}	
PrestoArray{ array=[1.539213421E9, 1.539240421E9, 1.539241423E9, 1.539242421E9, 1.539243221E9, 1.539243221E9, 1.539243411E9]}	
PrestoArray{ array=[27000.0, 1002.0, 998.0, 800.0, 0.0, 190.0, 0.0]}
userid1	
PrestoArray{ array=[detail, detail, index, page, index, page, detail]}	
PrestoArray{ array=[1.539213421E9, 1.539223421E9, 1.539242423E9, 1.539243121E9, 1.539243421E9, 1.539243421E9, 1.539253421E9]}	
PrestoArray{ array=[10000.0, 19002.0, 698.0, 300.0, 0.0, 10000.0, 0.0]}



数据平铺,搞定

 with a as (SELECT uid,pages,starts,result from hadoop.wh.site_flow_tmp3 )
 select uid,page,starttime,pagetime from a CROSS JOIN UNNEST(pages,starts,result) t (page,starttime,pagetime)
 

uid▼page▼starttime▼pagetime▼
userid2	detail	1539213421	27000
userid2	detail	1539240421	1002
userid2	index	1539241423	998
userid2	detail	1539242421	800
userid2	index	1539243221	0
userid2	page	1539243221	190
userid2	page	1539243411	0
userid1	detail	1539213421	10000
userid1	detail	1539223421	19002
userid1	index	1539242423	698
userid1	page	1539243121	300
userid1	index	1539243421	0
userid1	page	1539243421	10000
userid1	detail	1539253421	0
  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值