问题背景:
给出一张流水表A,字段:uid,date(时间戳、流水中每次的登录时间),day(每天分区,格式yyyymmdd),增量表,以update_time分区),希望输出以下表格,其中,横坐标和纵坐标自动更新每天的留存情况
日期 | 新增用户数 | 次日留存用户数 | 三日留存用户数 | 四日留存用户数 | ...... |
2021-3-1 | |||||
2021-3-2 | |||||
...... |
步骤:
确认口径:次N日所有用户留存率 当日登录的用户数且在第N日仍然登录用户数 / 当日登录用户数;N:1,2,3......;
第一步:求当日新增用户数
考虑的问题: 数据量的问题,将date拆分为两个时间字段,create_time(第一次登录时间,只记录一次,10位)、update_time(当日最后一次登录时间)
创建中间表登录表log:字段:uid,create_time,update_time,day
today=$(date -d +"%Y%m%d") -- 今天
today_stamp=$(date -d "$today" +%s) --今天的零点时间戳
yesterday=$(date -d "1 day ago" +"%Y%m%d") -- 昨天
yesterday_stamp=$(date -d "$yesterday" +%s) --昨天的零点时间戳
sql="insert into table log_frist partition (day=${yesterday}) --插入昨天分区的新增用户的数据
select uid,min(date) as create_time,max(date) as update_time,day
from A
where day= '${yesterday}'
and uid not in (
select uid
from log_frist
day = date_sub(to_date('${yesterday}'),1)
)
group by 1,2,3
having by min(date)>= '${yesterday_stamp}' and min(date)<= '${today_stamp}'
;
"
第二步:计算当日活跃的用户, 因为是流水表,每次登录都会被记录,count时需要去重 distinct
select
from ()
按照固定的1日留存,3日留存,7日留存,14日留存
SELECT
m.thedate
,m.uv
,m.d1_uv
,ROUND(m.d1_uv/m.uv,4) AS d1_retention_rate
,m.d3_uv
,ROUND(m.d3_uv/m.uv,4) AS d3_retention_rate
,m.d7_uv
,ROUND(m.d7_uv/m.uv,4) AS d7_retention_rate
,m.d14_uv
,ROUND(m.d14_uv/m.uv,4) AS d14_retention_rate
,m.d30_uv
,ROUND(m.d30_uv/m.uv,4) AS d30_retention_rate
FROM (
SELECT
dt AS thedate
,COUNT(DISTINCT t.user_id) AS uv
,COUNT(DISTINCT CASE WHEN dt = d1 THEN t.user_id END) d1_uv
,COUNT(DISTINCT CASE WHEN dt = d3 THEN t.user_id END) d3_uv
,COUNT(DISTINCT CASE WHEN dt = d7 THEN t.user_id END) d7_uv
,COUNT(DISTINCT CASE WHEN dt = d14 THEN t.user_id END) d14_uv
,COUNT(DISTINCT CASE WHEN dt = d30 THEN t.user_id END) d30_uv
FROM (
SELECT
TO_DATE(TO_CHAR(FROM_UNIXTIME(paytime),'yyyymmdd'),'yyyymmdd') AS dt
,buyeruserid AS user_id
FROM log.dwd_collection_exchange_trade_a_d
WHERE dt=MAX_PT('log.dwd_collection_exchange_trade_a_d')
AND paytime>0
and TO_CHAR(FROM_UNIXTIME(paytime),'yyyymmdd')>='20210528'
and TO_CHAR(FROM_UNIXTIME(paytime),'yyyymmdd')<=${bdp.system.bizdate}
) t
LEFT JOIN (
SELECT
dt AS ds
,user_id AS userid
,DATEADD(TO_DATE(dt,'yyyymmdd'),-1,'dd') d1
,DATEADD(TO_DATE(dt,'yyyymmdd'),-3,'dd') d3
,DATEADD(TO_DATE(dt,'yyyymmdd'),-7,'dd') d7
,DATEADD(TO_DATE(dt,'yyyymmdd'),-14,'dd') d14
,DATEADD(TO_DATE(dt,'yyyymmdd'),-30,'dd') d30
FROM log.dwd_collection_exchange_pageshow_i_d
WHERE dt>='20210528'
AND dt<=${bdp.system.bizdate}
AND user_id IS NOT NULL
GROUP BY dt
,user_id
,DATEADD(TO_DATE(dt,'yyyymmdd'),-1,'dd')
,DATEADD(TO_DATE(dt,'yyyymmdd'),-3,'dd')
,DATEADD(TO_DATE(dt,'yyyymmdd'),-7,'dd')
,DATEADD(TO_DATE(dt,'yyyymmdd'),-14,'dd')
,DATEADD(TO_DATE(dt,'yyyymmdd'),-30,'dd')
)v ON v.userid = t.user_id AND t.dt>=v.d30 AND t.dt<=v.d1
GROUP BY dt
) m
按照每日,每月留存,
中间表:
INSERT OVERWRITE TABLE mid_user_for_report_monthvisit PARTITION(pt)
select
u1.user_id,u1.month_pt,u.risk_type
,u.pt
from(
-- 当月最后一日用户是否风控
SELECT
user_id,risk_type
,substr(pt,1,6) month_pt
,pt
from mid_user_for_report
where pt = '${bdp.system.bizdate}'
and to_char(dateadd(to_date('${bdp.system.bizdate}','yyyymmdd'),1,'dd'),'yyyymmdd') like '%01'
-- and risk_type <> '风控'
)u
join(
select
-- 当月访问
SUBSTR(pt,1,6) month_pt,user_id
from mid_user_for_report
where SUBSTR(pt,1,6) = SUBSTR('${bdp.system.bizdate}',1,6)
and is_visit = 1
group by SUBSTR(pt,1,6) ,user_id
)u1 on u1.user_id = u.user_id and u1.month_pt = u.month_pt
;
-- drop table if exists ads_user_for_report_monthvisit;
create table if not exists ads_user_for_report_monthvisit (
month_pt string
,diffmonth int
,rv_uv int
)PARTITIONED by (ds string )
;
最终报表:
INSERT OVERWRITE TABLE ads_user_for_report_monthvisit PARTITION(ds)
SELECT
t.month_pt
,DATEDIFF(TO_DATE(t1.month_pt,'yyyymm'),TO_DATE(t.month_pt,'yyyymm'),'mm') diffmonth
-- ,count(t.user_id) uv
,count(t1.user_id) rv_uv
,concat(t.month_pt,'-',DATEDIFF(TO_DATE(t1.month_pt,'yyyymm'),TO_DATE(t.month_pt,'yyyymm'),'mm'))
from (
--
SELECT
*
from mid_user_for_report_monthvisit
where pt <= '${bdp.system.bizdate}'
and risk_type <> '风控'
)t
left join (
select
user_id,month_pt
from mid_user_for_report_monthvisit
where pt = '${bdp.system.bizdate}'
)t1 on t.user_id = t1.user_id
where DATEDIFF(TO_DATE(t1.month_pt,'yyyymm'),TO_DATE(t.month_pt,'yyyymm'),'mm') is not null
group by t.month_pt ,DATEDIFF(TO_DATE(t1.month_pt,'yyyymm'),TO_DATE(t.month_pt,'yyyymm'),'mm')
,concat(t.month_pt,'-',DATEDIFF(TO_DATE(t1.month_pt,'yyyymm'),TO_DATE(t.month_pt,'yyyymm'),'mm'))