笔者在求职数据分析师岗位时,在知乎上看到了一道非常经典的SQL题,问题来源: link。笔者认为彻底的搞懂这道题以及涉及的知识点,能够通过绝大部分数据分析岗位的SQL笔试。下面是这道题的问题及解答,答案为笔者原创,代码均在oracle数据库调试通过,但可能有考虑不周全的地方或其他解法,欢迎大家在评论区交流。
表结构介绍
1、用户活跃表:tmp_liujg_dau_based
1)imp_date,日期,string格式,样例20190601
2)qimei,用户唯一标识,string格式,无空值
3)is_new,新用户标识,string,1表示新用户,0表示老用户
说明:以imp_date、qimei为主键,一个用户1天只出现1次,出现即表示当日登陆
2、红包活动参与表:tmp_liujg_packet_based
1)imp_date,日期,string格式,样例20190601
2)report_time,领取时间戳,string格式
3)qimei,用户唯一标识,string格式,无空值
4)add_money,领取金额,string格式,表示领取金额,单位为分,无空值或0值
说明:日志流水表,每一行为领取1次红包。无特殊情况说明,一般不考虑领取红包但当日未登录的情况
问题描述
第一问
计算20190601至今,每日DAU(活跃用户量,即有登陆的用户)
输出维度:imp_date(日期)
输出指标:dau
第二问
计算20190601至今,每日领取红包的新用户数,老用户数,及人均领取金额,人均领取次数
输出维度:imp_date(日期),is_new(新用户1,老用户0,未知2)
输出指标:mean_money(人均领取金额),mean_get_count(人均领取次数)
第三问
计算2019年3月,每个月按领红包取天数为1、2、3……30、31天区分,计算取每个月领取红包的用户数,人均领取金额,人均领取次数
输出维度:month(月份),get_money_das(领取天数)
输出指标:user_count(用户数),mean_money(人均领取金额),mean_get_count(人均领取次数)
第四问
计算2019年3月,每个月领过红包用户和未领红包用户的数量,平均月活跃天数(即本月平均活跃多少天)
输出维度:month(月份),is_packet_user(红包用户1,非红包用户0)
输出指标:user_count(用户数量),mean_days_in_month(月活跃天数)
第五问
计算2019年3月至今,每个月活跃用户的注册日期,2019年3月1日前注册的用户日期填空即可
输出维度:month(月份),qimei(用户唯一标识)
输出指标:register_date(注册日期)
第六问
计算2019年3月至今,每日的用户次日留存率,领取红包用户的次日留存,未领取红包用户的次日留存率
输出维度:imp_date(日期)
输出指标:dau(当日用户数),retain_rate(次日留存率)
,packet_user_retain_rate(当日领红包用户次日留存率),unpacket_user_retain_rate(当日未领红包用户次日留存率)
第七问
计算2019年3月1日至今,每日新用户领取的第一个红包的金额
输出维度:imp_date(日期),qimei(用户唯一标识)
输出指标:money(第一个红包的金额)
第八问
计算2019年3月1日至今,每个新用户领取的第一个红包和第二个红包的时间差(只计算注册当日有领取红包的用户,注册当日及以后的DAU表中新用户为1的用户)
输出维度:imp_date(注册日期),qimei(用户唯一标识)
输出指标:first_action_time(首次领红包时间),second_action_time(第二次领红包时间),delta_times(时间差)
第九问
计算2019年6月1日至今,每日领取红包用户领取金额的中位数
输出维度:imp_date(日期)
输出指标:mid_money(用户当日领取红包金额的中位数)
问题解答
建表
create table tmp_liujg_dau_based(-- 活跃用户表
imp_date varchar2(32),-- 日期
qimei varchar2(32),-- 用户唯一标识
is_new varchar2(16));-- 新用户标识:1新用户 0老用户
create table tmp_liujg_packed_based (-- 红包领取表
imp_date varchar2(32),-- 日期
report_time varchar2(32),-- 领取时间戳
qimei varchar2(32),-- 用户唯一标识
add_money varchar2(32));
插入数据
-- 插入用户表数据
insert all
into tmp_liujg_dau_based values('20190301','001','0')
into tmp_liujg_dau_based values('20190301','002','0')
into tmp_liujg_dau_based values('20190301','003','1')
into tmp_liujg_dau_based values('20190301','004','1')
into tmp_liujg_dau_based values('20190301','005','1')
into tmp_liujg_dau_based values('20190301','006','1')
into tmp_liujg_dau_based values('20190302','001','0')
into tmp_liujg_dau_based values('20190302','002','0')
into tmp_liujg_dau_based values('20190302','003','0')
into tmp_liujg_dau_based values('20190302','005','0')
into tmp_liujg_dau_based values('20190302','006','0')
into tmp_liujg_dau_based values('20190302','007','1')
into tmp_liujg_dau_based values('20190303','005','0')
into tmp_liujg_dau_based values('20190303','006','0')
into tmp_liujg_dau_based values('20190303','007','0')
into tmp_liujg_dau_based values('20190303','008','1')
into tmp_liujg_dau_based values('20190303','009','1')
into tmp_liujg_dau_based values('20190303','010','1')
into tmp_liujg_dau_based values('20190401','008','0')
into tmp_liujg_dau_based values('20190401','009','0')
into tmp_liujg_dau_based values('20190401','010','0')
into tmp_liujg_dau_based values('20190401','011','1')
into tmp_liujg_dau_based values('20190401','012','1')
into tmp_liujg_dau_based values('20190402','009','0')
into tmp_liujg_dau_based values('20190402','010','0')
into tmp_liujg_dau_based values('20190402','011','0')
into tmp_liujg_dau_based values('20190402','012','0')
into tmp_liujg_dau_based values('20190402','013','1')
into tmp_liujg_dau_based values('20190402','014','1')
into tmp_liujg_dau_based values('20190501','001','0')
into tmp_liujg_dau_based values('20190501','002','0')
into tmp_liujg_dau_based values('20190501','008','0')
into tmp_liujg_dau_based values('20190501','007','0')
into tmp_liujg_dau_based values('20190501','015','1')
into tmp_liujg_dau_based values('20190501','016','1')
into tmp_liujg_dau_based values('20190501','017','1')
into tmp_liujg_dau_based values('20190501','018','1')
into tmp_liujg_dau_based values('20190601','008','0')
into tmp_liujg_dau_based values('20190601','017','0')
into tmp_liujg_dau_based values('20190601','018','0')
into tmp_liujg_dau_based values('20190601','019','1')
into tmp_liujg_dau_based values('20190601','020','1')
into tmp_liujg_dau_based values('20190601','021','1')
into tmp_liujg_dau_based values('20190601','022','1')
into tmp_liujg_dau_based values('20190603','021','0')
into tmp_liujg_dau_based values('20190603','022','0')
into tmp_liujg_dau_based values('20190603','011','0')
into tmp_liujg_dau_based values('20190603','012','0')
into tmp_liujg_dau_based values('20190603','023','1')
into tmp_liujg_dau_based values('20190701','023','0')
into tmp_liujg_dau_based values('20190701','008','0')
into tmp_liujg_dau_based values('20190701','011','0')
into tmp_liujg_dau_based values('20190701','022','0')
into tmp_liujg_dau_based values('20190701','012','0')
into tmp_liujg_dau_based values('20190701','024','1')
into tmp_liujg_dau_based values('20190701','025','1')
into tmp_liujg_dau_based values('20190701','026','1')
into tmp_liujg_dau_based values('20190701','027','1')
into tmp_liujg_dau_based values('20190705','026','0')
into tmp_liujg_dau_based values('20190705','027','0')
into tmp_liujg_dau_based values('20190705','009','0')
into tmp_liujg_dau_based values('20190705','010','0')
into tmp_liujg_dau_based values('20190705','028','1')
into tmp_liujg_dau_based values('20190705','029','1')
select 1 from dual
-- 插入红包表数据
insert all
into tmp_liujg_packed_based values('20190301','2019/03/01 10:15:01','001','1.05')
into tmp_liujg_packed_based values('20190301','2019/03/01 13:15:01','001','2.30')
into tmp_liujg_packed_based values('20190301','2019/03/01 11:15:01','002','0.80')
into tmp_liujg_packed_based values('20190301','2019/03/01 17:15:01','002','0.89')
into tmp_liujg_packed_based values('20190301','2019/03/01 14:15:01','003','2.12')
into tmp_liujg_packed_based values('20190301','2019/03/01 18:15:01','003','1.12')
into tmp_liujg_packed_based values('20190301','2019/03/01 17:15:01','005','1.12')
into tmp_liujg_packed_based values('20190301','2019/03/01 19:15:01','005','0.12')
into tmp_liujg_packed_based values('20190301','2019/03/01 09:15:01','006','0.98')
into tmp_liujg_packed_based values('20190301','2019/03/01 11:15:01','006','1.45')
into tmp_liujg_packed_based values('20190302','2019/03/02 09:30:01','001','0.78')
into tmp_liujg_packed_based values('20190302','2019/03/02 19:30:01','001','0.88')
into tmp_liujg_packed_based values('20190302','2019/03/02 10:30:01','003','0.68')
into tmp_liujg_packed_based values('20190302','2019/03/02 09:30:01','005','1.01')
into tmp_liujg_packed_based values('20190302','2019/03/02 14:30:01','005','1.88')
into tmp_liujg_packed_based values('20190302','2019/03/02 14:30:01','006','1.88')
into tmp_liujg_packed_based values('20190302','2019/03/02 10:30:01','006','0.68')
into tmp_liujg_packed_based values('20190302','2019/03/02 15:30:01','007','0.68')
into tmp_liujg_packed_based values('20190302','2019/03/02 16:30:01','007','1.78')
into tmp_liujg_packed_based values('20190303','2019/03/03 16:30:01','005','0.68')
into tmp_liujg_packed_based values('20190303','2019/03/03 08:50:01','006','0.32')
into tmp_liujg_packed_based values('20190303','2019/03/03 10:50:01','006','1.78')
into tmp_liujg_packed_based values('20190303','2019/03/03 10:50:01','007','0.32')
into tmp_liujg_packed_based values('20190303','2019/03/03 15:50:01','008','1.01')
into tmp_liujg_packed_based values('20190303','2019/03/03 17:50:01','008','1.68')
into tmp_liujg_packed_based values('20190303','2019/03/03 10:30:01','010','1.88')
into tmp_liujg_packed_based values('20190303','2019/03/03 15:50:01','010','0.32')
into tmp_liujg_packed_based values('20190401','2019/04/01 09:50:00','008','0.18')
into tmp_liujg_packed_based values('20190401','2019/04/01 11:50:00','009','0.88')
into tmp_liujg_packed_based values('20190401','2019/04/01 12:35:00','009','0.32')
into tmp_liujg_packed_based values('20190401','2019/04/01 17:50:00','010','1.01')
into tmp_liujg_packed_based values('20190401','2019/04/01 12:35:00','011','1.68')
into tmp_liujg_packed_based values('20190401','2019/04/01 12:50:00','011','0.88')
into tmp_liujg_packed_based values('20190401','2019/04/01 15:50:00','012','0.32')
into tmp_liujg_packed_based values('20190401','2019/04/01 17:50:00','012','1.68')
into tmp_liujg_packed_based values('20190401','2019/04/01 12:35:00','012','1.88')
into tmp_liujg_packed_based values('20190402','2019/04/02 09:50:00','009','0.18')
into tmp_liujg_packed_based values('20190402','2019/04/02 19:50:00','009','1.18')
into tmp_liujg_packed_based values('20190402','2019/04/02 17:50:00','010','0.88')
into tmp_liujg_packed_based values('20190402','2019/04/02 16:50:00','010','0.32')
into tmp_liujg_packed_based values('20190402','2019/04/02 19:50:00','010','0.32')
into tmp_liujg_packed_based values('20190402','2019/04/02 16:50:00','013','0.88')
into tmp_liujg_packed_based values('20190402','2019/04/02 19:55:00','013','0.88')
into tmp_liujg_packed_based values('20190402','2019/04/02 20:55:00','013','1.01')
into tmp_liujg_packed_based values('20190402','2019/04/02 19:55:00','014','0.32')
into tmp_liujg_packed_based values('20190402','2019/04/02 20:55:00','014','1.01')
into tmp_liujg_packed_based values('20190501','2019/05/01 09:50:00','001','1.18')
into tmp_liujg_packed_based values('20190501','2019/05/01 09:55:00','002','0.32')
into tmp_liujg_packed_based values('20190501','2019/05/01 10:55:00','002','0.32')
into tmp_liujg_packed_based values('20190501','2019/05/01 10:00:00','007','0.88')
into tmp_liujg_packed_based values('20190501','2019/05/01 10:55:00','015','0.88')
into tmp_liujg_packed_based values('20190501','2019/05/01 10:00:00','015','0.32')
into tmp_liujg_packed_based values('20190501','2019/05/01 14:00:00','017','1.01')
into tmp_liujg_packed_based values('20190501','2019/05/01 15:00:00','017','1.01')
into tmp_liujg_packed_based values('20190501','2019/05/01 15:30:00','018','0.88')
into tmp_liujg_packed_based values('20190501','2019/05/01 16:30:00','018','0.68')
into tmp_liujg_packed_based values('20190601','2019/06/01 09:50:00','008','1.38')
into tmp_liujg_packed_based values('20190601','2019/06/01 14:50:00','017','0.88')
into tmp_liujg_packed_based values('20190601','2019/06/01 11:50:00','019','1.01')
into tmp_liujg_packed_based values('20190601','2019/06/01 13:50:00','019','0.88')
into tmp_liujg_packed_based values('20190601','2019/06/01 14:50:00','019','0.68')
into tmp_liujg_packed_based values('20190602','2019/06/02 09:50:00','021','0.38')
into tmp_liujg_packed_based values('20190602','2019/06/02 13:50:00','012','0.88')
into tmp_liujg_packed_based values('20190602','2019/06/02 15:50:00','012','1.12')
into tmp_liujg_packed_based values('20190602','2019/06/02 13:59:00','023','0.88')
into tmp_liujg_packed_based values('20190602','2019/06/02 15:50:00','023','1.01')
into tmp_liujg_packed_based values('20190701','2019/07/01 09:50:00','023','0.38')
into tmp_liujg_packed_based values('20190701','2019/07/01 13:50:00','023','0.78')
into tmp_liujg_packed_based values('20190701','2019/07/01 14:50:00','008','0.68')
into tmp_liujg_packed_based values('20190701','2019/07/01 14:50:00','024','0.68')
into tmp_liujg_packed_based values('20190701','2019/07/01 15:50:00','024','1.68')
into tmp_liujg_packed_based values('20190701','2019/07/01 14:50:00','026','0.68')
into tmp_liujg_packed_based values('20190701','2019/07/01 18:50:00','026','1.68')
into tmp_liujg_packed_based values('20190701','2019/07/01 14:50:00','027','0.68')
into tmp_liujg_packed_based values('20190701','2019/07/01 19:35:00','027','1.11')
into tmp_liujg_packed_based values('20190702','2019/07/02 09:50:00','026','0.78')
into tmp_liujg_packed_based values('20190702','2019/07/02 11:50:00','026','0.78')
into tmp_liujg_packed_based values('20190702','2019/07/02 13:50:00','028','1.01')
into tmp_liujg_packed_based values('20190702','2019/07/02 14:35:00','028','0.88')
into tmp_liujg_packed_based values('20190702','2019/07/02 15:35:00','028','0.33')
select 1 from dual
解答
-- 第一问
select
imp_date,-- 日期
count(qimei) as dau
from
tmp_liujg_dau_based
where to_date(imp_date,'yyyy-mm-dd') > to_date('20190601','yyyy-mm-dd')
group by
imp_date
-- 第二问
select
a.imp_date-- 日期
,case when b.is_new='1' then '1'
when b.is_new='0' then '0'
else '2' end as is_new -- 用户类型
,count(distinct a.qimei) user_num
,sum(a.add_money) / count(distinct a.qimei) as mean_money -- 人均领取金额
,count(1) / count(distinct a.qimei) as mean_get_count -- 人均领取次数
from(select
imp_date,
qimei,
add_money
from tmp_liujg_packed_based
where to_date(imp_date,'yyyy-mm-dd') > to_date('20190601','yyyy-mm-dd')
) a
left join(select
imp_date,
qimei,
is_new
from tmp_liujg_dau_based
where to_date(imp_date,'yyyy-mm-dd') > to_date('20190601','yyyy-mm-dd')) b on b.qimei=a.qimei and b.imp_date=a.imp_date
group by
a.imp_date
,case when b.is_new='1' then '1'
when b.is_new='0' then '0'
else '2' end
-- 第三问
select
month-- 月份
,get_money_days-- 领取天数
,count(qimei) as user_count-- 用户数
,sum(tol_money) / count(qimei) as mean_money-- 人均领取金额
,sum(tol_cnt) / count(qimei) as mean_get_count-- 人均领取次数
from(select
to_char(to_date(imp_date,'yyyy-mm-dd'),'yyyymm') as month
,qimei
,count(distinct imp_date) as get_money_days
,sum(add_money) as tol_money
,count(1) as tol_cnt
from (select
imp_date,
qimei,
report_time,
add_money
from tmp_liujg_packed_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd'))
group by
to_char(to_date(imp_date,'yyyy-mm-dd'),'yyyymm')
,qimei)
group by
month
,get_money_days
-- 第四问
select
a.imp_mon as month-- 月份
,case when b.qimei is not null then '1'
when b.qimei is null then '0'
else null end as is_packet_user -- 用户类型 1红包用户 0非红包用户
,count(a.qimei) as user_count
,sum(a.active_days) / count(a.qimei) as mean_days_in_month
from
(select
substr(imp_date,1,6) as imp_mon
,qimei
,count(imp_date) as active_days
from tmp_liujg_dau_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd')
group by
substr(imp_date,1,6)
,qimei ) a
left join
(select
substr(imp_date,1,6) as imp_mon
,qimei
from tmp_liujg_packed_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd')
group by
substr(imp_date,1,6)
,qimei) b on b.imp_mon=a.imp_mon and b.qimei=a.qimei
group by
a.imp_mon
,case when b.qimei is not null then '1'
when b.qimei is null then '0'
else null end
-- 第五问
select
a.month,
a.qimei,
case when b.imp_date is not null then b.imp_date else null end as register_date
from (select
to_char(to_date(a.imp_date,'yyyy-mm-dd'),'yyyymm') as month
,qimei
from tmp_liujg_dau_based
group by
to_char(to_date(a.imp_date,'yyyy-mm-dd'),'yyyymm')
,qimei
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd')) a
left join(select
qimei
,imp_date
from tmp_liujg_dau_based
group by
qimei
,imp_date
where is_new='1'
and to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd')) b on b.qimei=a.qimei
-- 第六问
select
a1.day
,count(a1.qimei) as dau
,count(case when a2.day-a1.day=1 then a1.qimei else null end) / count(a1.qimei) as retain_rate
,count(case when b.qimei is not null then
case when a2.day-a1.day=1 then a1.qimei else null end else null end)
/ decode(count(case when b.qimei is not null then a1.qimei else null end),0,1,count(case when b.qimei is not null then a1.qimei else null end)) as pick_user_rate
,count(case when b.qimei is null then
case when a2.day-a1.day=1 then a1.qimei else null end else null end)
/ decode(count(case when b.qimei is null then a1.qimei else null end),0,1,count(case when b.qimei is null then a1.qimei else null end)) as unpick_user_rate
from(select
to_date(imp_date,'yyyy-mm-dd') as day,
qimei
from tmp_liujg_dau_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd')) a1
left join(select
to_date(imp_date,'yyyy-mm-dd') as day,
qimei
from tmp_liujg_dau_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd')) a2 on a2.qimei=a1.qimei
left join(select
to_date(imp_date,'yyyy-mm-dd') as day,
qimei
from tmp_liujg_packed_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd')
group by
imp_date
, qimei ) b on b.day=a1.day and b.qimei=a1.qimei
group by
a1.day
-- 第七问
select
imp_date
,qimei
,add_money as money
from(select
a.imp_date
,a.qimei
,a.report_time
,a.add_money
,row_number()over(partition by a.imp_date,a.qimei order by a.report_time asc) as report_rank
from (select
imp_date,
qimei,
add_money,
report_time
from tmp_liujg_packed_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd')
)a
inner join(select
imp_date,
qimei
from tmp_liujg_dau_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd')
and is_new=1
)b on b.imp_date=a.imp_date and b.qimei=a.qimei)
where report_rank = 1
-- 第八问
select
imp_date-- 注册日期
,qimei-- 用户唯一标识
,report_time as first_action_time -- 首次领红包时间
,n1_report_time as second_action_time -- 第二次领红包时间
,to_date(n1_report_time,'yyyy-mm-dd hh24:mi:ss') - to_date(report_time,'yyyy-mm-dd hh24:mi:ss') as delta_times -- 时间差
from(select
b.imp_date
,a.qimei
,a.report_time
,row_number()over(partition by a.imp_date,a.qimei order by a.report_time asc) as report_rank -- 领红包时间的排序
,lead(a.report_time,1)over(partition by a.imp_date,a.qimei order by a.report_time asc) as n1_report_time -- 第二次领红包的时间
from(select
imp_date,
qimei,
report_time
from tmp_liujg_packed_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd'))a
inner join(select
imp_date,
qimei
from tmp_liujg_dau_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190301','yyyy-mm-dd')
and is_new=1
)b on b.imp_date=a.imp_date and b.qimei=a.qimei)
where report_rank = 1
-- 第九问
9.计算2019年6月1日至今,每日领取红包用户领取金额的中位数
输出维度:imp_date(日期)
输出指标:mid_money(用户当日领取红包金额的中位数)
select
imp_date-- 日期
,median(to_number(add_money)) as mid_money -- 用户当日领取红包金额的中位数
from tmp_liujg_packed_based
where to_date(imp_date,'yyyy-mm-dd') >= to_date('20190601','yyyy-mm-dd')
group by imp_date