guid和first_dt,rng_start很好解决,
rng_end需要判断.
昨天不在今天在,那么保留老记录,新添加一行 ----新添加这一行是09 05 -9999-12-21
昨天在今天也在 那么嗨是保留记录
昨天不在今天也不在保留记录 --保留原纪录)
昨天在,今天不在 =T-1
新用户
将二表full join在一起,
这个时候有以下情况
情况1:前日不在,今在,(老记录要保留,新添一行)--缺
情况2:前日不在,今不在(保留原纪录)
情况3:前日在,今不在(修改此人的最后区间-a.dt)
情况4:前日在,今在(保留原纪录)
情况5:新用户,(添加记录)
select
nvl(a.guid,b.guid) as guid,
nvl(a.first_dt,b.dt) as first_dt,
nvl(a.rng_start,b.dt) as rng_start,
case when a.rng_end='9999-12-31' and b.dt is null then a.dt
when a.rng_end is null then '9999-12-31'
else a.rng_end end as rng_end
from
use_test4 a
full join use_test5 b on a.guid=b.guid and a.dt='2020-09-04' AND b.dt='2020-09-05'
select
a.guid,
a.first_dt,
b.dt,
'9999-12-31' as rng_end
from
(
select
guid,
first_dt
from
use_test4
where dt='2020-09-04'
group by guid ,first_dt having max(rng_end != '9999-12-31')
) a join
use_test5 b on a.guid=b.guid and b.dt='2020-09-05'
昨日没有登录,而今天登录的用户,
单独求出guid 嗯rng_end!=9999-12-31
create table doit.use_test4(
guid string ,
first_dt string,
rng_start string,
rng_end string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;
load data local inpath '/root/usetext4.txt' into table doit.use_test4 PARTITION(dt='2020-09-04') ;
CREATE TABLE doit.use_test5(
guid string
)
PARTITIONED BY (DT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;
LOAD DATA LOCAL INPATH '/root/usetext5.txt' INTO TABLE doit.use_test5 PARTITION(dt='2020-09-05');
如何统计新老用户留存
有以下几种情况
起始日期,留存起始日期,留存天数和留存人数,
拆开来看
选算出
每个guid,first_dt,rng_start,rng_end 的次数
这里分为以下几种情况 ,我采用如果昨天登录了就给它个9999-12-31天数,
有以下几种情况.
昨天来了,今天也来了,那么其实日期就是start_dt 到9999-12-31天数
昨天来了,今天没来,那么就给它昨天的天数
昨天没来,今天也没来 保持不变 因为之前就已经给了他天数,上一个情况
昨天没来,今天来了,这种情况就给它9999
1. 用区间记录表的T-1日 FULL JOIN 日活T日
得到结果part1:
历史记录保留
新增用户新增
之前没封闭的区间今日封闭(今日没来)
2. 从区间记录表的T-1日中过滤出所有昨天没活跃的人, JOIN T日的日
得到结果part2
这些人应该新增的行
select
a.first_dt,
b.*
from
use_test4 a
full join
use_test5 b
on a.dt='2020-09-04' and b.dt='2020-09-05' and a.guid=b.guid
select
nvl(a.guid,b.guid) as guid,
nvl(a.first_dt,b.dt) as first_dt,
nvl(rng_start,b.dt) as rng_end,
case when a.rng_end='9999-12-31' and b.dt is null then a.dt
when a.rng_end is null then '9999-12-31'
else a.rng_end
end as rng_end
from
use_test4 a
full join
use_test5 b
on a.dt='2020-09-04' and b.dt='2020-09-05' and a.guid=b.guid
UNION ALL
select
a.guid,
a.first_dt,
b.dt,
'9999-12-31' as rng_end
from
(
select
guid,
first_dt
from
use_test4
where dt='2020-09-04' group by guid,first_dt having max(rng_end)!='9999-12-31'
) a
join
use_test5 b on b.dt='2020-09-05' and a.guid=b.guid
区间记录表:4号
g01,2020-09-01,2020-09-01,2020-09-01
g01,2020-09-01,2020-09-03,2020-09-03
g02,2020-09-01,2020-09-01,2020-09-02
g03,2020-09-01,2020-09-01,9999-12-31
g04,2020-09-02,2020-09-02,2020-09-02
g05,2020-09-02,2020-09-02,2020-09-03
g06,2020-09-02,2020-09-02,9999-12-31
g07,2020-09-03,2020-09-03,2020-09-03
g08,2020-09-03,2020-09-03,9999-12-31
日活表:5号
g01,2020-09-05
g03,2020-09-05
g11,2020-09-05
g12,2020-09-05
CREATE TABLE dws.app_user_retention(
dw_dt string,
new_dt string, -- 新增日
ret_days int, 留存天数
ret_amt int
)
STORED AS PARQUET
;
select
guid,
'2020-09-03' as dw_dt,
first_dt as new_dt,
datediff('2020-09-03',first_dt) as ret_days,
count(if(rng_end='9999-12-31',1,null)) as ret_amt
from
dws.app_user_act_rng where dt='2020-09-01' and datediff('2020-09-03',first_dt) <30
group by first_dt ,guid