计算每日某个指标留存,实现定时脚本
以某产品下载用户留存为例,计算次,3,4,5,6,7,14,30日留存
最终结果显示如下图:
思路解析
留存定义为在当天做某操作的用户在相对于当天的下一天(次留存),下两天(3留存)…该用户依旧做这个操作;这里举例的是下载,即当天下载的用户,在相对于当天的下一天(次留存),下两天(3留存)依旧有下载行为;但是因为要实现每天的定时脚本跑当天的后续事件留存。按照普通思路,以当天日期为基础,计算相对于当天的留存显然实现不了(因为数据的t+1更新,当天的数据为最新数据,当天的后续数据现在还没有),所以应该以当天时间为以前某天的时间的留存时间,例如今天是昨天的次留存计算时间,今天是前天的3留存时间,以此类推,利用已有的数据计算。
表结构介绍
此处以下载为例,下载表中字段如下:
字段名 | 类型 | 描述 |
---|---|---|
sourceid | bigint | 资源自增id |
username | string | 下载者用户名 |
dateline | bigint | 下载时间戳 |
代码
with tmp as (
SELECT
tolower(username) as username,
substr(from_unixtime(dateline),1,10) as pdate
FROM
download_source_userdown_source_base
)
insert into table dl_download_user_remain
SELECT
'${pt}' as pt,
'今日下载用户' as itype,
count(distinct username) as users
FROM tmp
WHERE pdate = '${pt}'
union all
SELECT
'${pt2}' as pt,
'次留下载用户' as itype,
count(distinct b.username) as users
FROM
(SELECT username FROM tmp WHERE pdate = '${pt}') a
left JOIN
(SELECT username FROM tmp WHERE pdate = '${pt2}') b
on a.username = b.username
union all
SELECT
'${pt3}' as pt,
'3留下载用户' as itype,
count(distinct b.username) as users
FROM
(SELECT username FROM tmp WHERE pdate = '${pt}') a
left JOIN
(SELECT username FROM tmp WHERE pdate = '${pt3}') b
on a.username = b.username
union all
SELECT
'${pt4}' as pt,
'4留下载用户' as itype,
count(distinct b.username) as users
FROM
(SELECT username FROM tmp WHERE pdate = '${pt}') a
left JOIN
(SELECT username FROM tmp WHERE pdate = '${pt4}') b
on a.username = b.username
union all
SELECT
'${pt5}' as pt,
'5留下载用户' as itype,
count(distinct b.username) as users
FROM
(SELECT username FROM tmp WHERE pdate = '${pt}') a
left JOIN
(SELECT username FROM tmp WHERE pdate = '${pt5}') b
on a.username = b.username
union all
SELECT
'${pt6}' as pt,
'6留下载用户' as itype,
count(distinct b.username) as users
FROM
(SELECT username FROM tmp WHERE pdate = '${pt}') a
left JOIN
(SELECT username FROM tmp WHERE pdate = '${pt6}') b
on a.username = b.username
union all
SELECT
'${pt7}' as pt,
'7留下载用户' as itype,
count(distinct b.username) as users
FROM
(SELECT username FROM tmp WHERE pdate = '${pt}') a
left JOIN
(SELECT username FROM tmp WHERE pdate = '${pt7}') b
on a.username = b.username
union all
SELECT
'${pt14}' as pt,
'14留下载用户' as itype,
count(distinct b.username) as users
FROM
(SELECT username FROM tmp WHERE pdate = '${pt}') a
left JOIN
(SELECT username FROM tmp WHERE pdate = '${pt14}') b
on a.username = b.username
union all
SELECT
'${pt30}' as pt,
'30留下载用户' as itype,
count(distinct b.username) as users
FROM
(SELECT username FROM tmp WHERE pdate = '${pt}') a
left JOIN
(SELECT username FROM tmp WHERE pdate = '${pt30}') b
on a.username = b.username
代码释义
with as临时表中先将下载表中的下载时间和下载用户做转化处理,每个union all 都是当天(在代码中变量值为’${pt}’)作为之前天数的对应的留存,’${pt2}’,’${pt3}’,’${pt4}’,’${pt5}’,’${pt6}’,’${pt7}’,’${pt14}’,’${pt30}‘为相对于今天减去后缀数字-1的天数日期 (例如:’${pt2}‘是’${pt}’-1天的日期,即当天的前一天日期,其他以此类推) ;dl_download_user_remain作为结果数据的存储表,改表中字段如下:
字段 | 字段类型 | 字段释义 |
---|---|---|
pdate | string | 时间 |
itype | string | 区分是什么用户数(当天的下载用户数/某天的留存用户数) |
users | bigint | 用户数 |
结果数据表中存在多条某一天的不同数据(此处的不同指的是itype区别的某天的留存),部分结果表数据如图:
展示阶段
到这数据已经准备好了,现在需要做的是将这些数据做最后的计算。这里用于展示的平台是阿里的qiuck bi,采用其即席分析sql创建数据集。代码如下:
SELECT
a.pdate,
a.itype,
a.users,
cast(a.users as float)/cast(b.users as float)
FROM
dl_download_user_remain a
left join
(select * from dl_download_user_remain where itype = '今日下载用户') b
on a.pdate = b.pdate
这里做的是将准备好的数据以日期为关联条件自连接将所有日期相同的数据拉平。且用留存用户数除以对应日期的当天下载用户数,以计算留存率。
总结
计算留存的难点在于数据的时效性有限。不能计算当天在以后的时间中的留存,而是将当天作为之前时间的留存天计算。