用户停留时长(Time on Page),我们可以简单理解为用户访问两个页面之间的时间差。因为对于每个页面,打开时,前端只需要上报固定格式的日志,我们就可以利用日志分析出用户的停留时间,举个例子,日志格式如下:
{"ip":"101.82.208.128","event_type":"show","user_id":null,"terminal":"H5_WEIXIN","lang":"zh-cn","ua":"Mozilla/5.0 (Linux; Android 11; M2011K2C Build/RKQ1.200928.002; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/89.0.4389.72 MQQBrowser/6.2 TBS/045811 Mobile Safari/537.36 MMWEBID/5037 MicroMessenger/8.0.15.2020(0x28000F3D) Process/tools WeChat/arm64 Weixin NetType/5G Language/zh_CN ABI/arm64","event_time":1635563210.0,"path":"/pages/invitation/exhibitor?id=4881&_share_from_uid=1756&code=0217I2Ga16pJ1C0nMUHa1Gfo1z27I2GQ&state=1635563193","ext":{},"log_id":"6cd01da0-392e-11ec-a49e-7b19f3c9f86f","browser_id":"d32d82002348e054b8336436ee21321a","@log_name":"oplog","created_at":1635563209.0}
1 在spark中利用窗口函数对用户id分区按日志产生的时间戳排序
session_window =Window.partitionBy("user_id").orderBy(functions.col("event_time").desc())
2 利用functions.lead函数创建新的列
diff_df = df.withColumn("pre_time", functions.lead("event_time", 1).over(session_window))\
.filter("user_id is not null and user_id !='' ")\
.select("page""user_id", "event_time","pre_time" )
3 计算页面时间戳差值,也就是下一个页面的时间戳减去前面一个页面的时间戳。过滤无效数据再聚合相加时间:
seconds_df = diff_df.withColumn("stay_time", functions.col("pre_time")-functions.col("event_time"))\
.filter("stay_time is not null and stay_time !=0") \
.filter("stay_time < 1800 ")\
.groupBy("page", "sponsor_id", "user_id")\
.agg(functions.sum("stay_time").alias("stay_time")) \
.select("page", "sponsor_id", "user_id", "stay_time") \
.show(truncate=False)
有一点需要注意的是用户在单个会话中的最后访问的页面是无法统计到的,因为你不知道他是什么时候离开,所以我这里假设页面之间的时间戳差值超过半小时,则默认为一个新的会话。结果: