SQL数据分析极简入门
第三章 SQL面试题
前言
参与CDA_SQL数据分析极简入门,记录学习进度,便于复习
一、近N日登录与连续登录N日问题
题目
现有用户登录表(user_active_log)一份,里面有2个字段:userlD(用户ID), createdTime(登录时间戳),需要统计2021年12月连续登录7天的用户数量。
select userId,createdTime
from user_active_log limit 10;
userId | createdTime |
---|---|
34537 | 1577894672 |
34636 | 1577894683 |
38913 | 1577895261 |
… | … |
分析过程
构造一个连续数字构成的辅助列,用原始日期减去辅助列的数字,得到一个新日期,根据这个新日期来判断是否连续;
或者构造一个连续日期构成的辅助列,用原始日期减去这个辅助列的日期,得到一个新数字,最后根据这个数字来判断连续;
这里分别展示两种思路如下:
方法A:
日期 | 辅助列 | 新日期 |
---|---|---|
2021-12-02 | 1 | 2021-12-01 |
2021-12-03 | 2 | 2021-12-01 |
2021-12-04 | 3 | 2021-12-01 |
… | … | … |
方法B:
日期 | 辅助列 | 新数字 |
---|---|---|
2021-12-02 | 2021-12-01 | 1 |
2021-12-03 | 2021-12-02 | 1 |
2021-12-04 | 2021-12-03 | 1 |
… | … | … |
一般采用A方案。
step1:选择12月的记录,并根据用户ID和登录日期先去重(注:单个用户一天有多行登录数据的情况,只保留1行)
step2:创建辅助列a_rk(每个userID下的日期排序值)
step3:创建辅助列起步时间 b_createdTime(用登录日期减去排序值,得到新时间列)
step4:根据起步实际教案列统计连续登录天数
step5:根据统计结果查询连续登录人数(题目要求连续7天)
step1:
选择12月的记录,并根据用户ID和登录日期先去重(注:单个用户一天有多行登录数据的情况,只保留1行)
select userID,date(from_unixtime(createdTime)) a_createdTime
from user_active_log
where substr(date(from_unixtime(createdTime)),1,7)='2021-12'
#知识点1:时间戳转为时间字符串格式然后取前7个字符
group by userId,date(from_unixtime(createdTime))
#知识点2:根据userId,a_createdTime 去重
step2:
创建辅助列a_rk(每个userID下的日期排序值)
select userID,a_createdTime,row_number() over(partition by userId order by a_createdTime)a_rk
#知识点3:用row_number()对每个userID下的a_createdTime进行排名
from
(
select userID,date(from_unixtime(createdTime))a_createdTime
from user_active_log
where substr(from_unixtime(createdTime),1,7)='2021-12'
group by userId,a_createdTime)t0
step3:
创建辅助列起步时间 b_createdTime(用登录日期减去排序值,得到新时间列)
select *,date_sub(a_createdTime,interval a_rk day )b_createdTime
# 知识点4 date_sub 日期减去数字:datediff 日期减日期
from
(
select userID,a_createdTime,row_number() over(partition by userId order by a_createdTime)a_rk
#知识点3:用row_number()对每个userID下的a_createdTime进行排名
from
(
select userID,date(from_unixtime(createdTime))a_createdTime
from user_active_log
where substr(from_unixtime(createdTime),1,7)='2021-12'
group by userId,a_createdTime)t0) t1
step4:
根据起步实际教案列统计连续登录天数
select userId,b_createdTime,count(1) cts
from
(select *,date_sub(a_createdTime,interval a_rk day )b_createdTime
# 知识点4 date_sub 日期减去数字:datediff 日期减日期
from
(
select userID,a_createdTime,row_number() over(partition by userId order by a_createdTime)a_rk
#知识点3:用row_number()对每个userID下的a_createdTime进行排名
from
(
select userID,date(from_unixtime(createdTime))a_createdTime
from user_active_log
where substr(from_unixtime(createdTime),1,7)='2021-12'
group by userId,a_createdTime)t0) t1) t2 group by userId,b_createdTime having count(1)>6
#知识点5:having 用在groupby后条件筛选
step5:
根据统计结果查询连续登录人数(题目要求连续7天)
select count(distinct userId) num_users
from
(select userId,b_createdTime,count(1) cts
from
(select *,date_sub(a_createdTime,interval a_rk day )b_createdTime
# 知识点4 date_sub 日期减去数字:datediff 日期减日期
from
(
select userID,a_createdTime,row_number() over(partition by userId order by a_createdTime)a_rk
#知识点3:用row_number()对每个userID下的a_createdTime进行排名
from
(
select userID,date(from_unixtime(createdTime))a_createdTime
from user_active_log
where substr(from_unixtime(createdTime),1,7)='2021-12'
group by userId,a_createdTime)t0) t1) t2 group by userId,b_createdTime having count(1)>6
)t3;
运行结果如下: num_users 440
二、近N日留存的用户数及留存率
题目
现有用户登录表(user_active_log)一份,里面有2个字段:userlD(用户ID), createdTime(登录时间戳),需要统计近1、2、3、5、7、30日留存用户数量及留存率。
userId | createdTime |
---|---|
34537 | 1577894672 |
34636 | 1577894683 |
38913 | 1577895261 |
… | … |
构造一个起始日期构成的辅助列,用原始日期减去辅助列的日期,得到一个新数字N,根据这个新数字,结合起始日期来判断某个日期的近N日留存:
日期 | 辅助列 | 新数字 |
---|---|---|
2021-12-02 | 2021-12-01 | 1 |
2021-12-03 | 2021-12-01 | 2 |
2021-12-04 | 2021-12-01 | 3 |
… | … | … |
Step1:根据用户id和登录日期先去重.
Step2:创建新列first_time,获取每个userlD下的最早登录日期.
Step3:创建辅助列delta_time,用登录日期列减去最早登录日期first_time,得到留存天数
Step4:按first_time列统计不同留存天数对应的次数和 即 某日的近N日留存数
Step5:用某日的近N日留存数除以首日登录人数即留存率
Step1:
根据用户id和登录日期先去重
select
userID,
date(from_unixtime(createdTime))a_createdTime
from user_active_log
group by userId,a_createdTime;
Step2:
创建新列first_time,获取每个userlD下的最早登录日期.
select
userID,
a_createdTime,
first_value(a_createdTime)over(partition by userId order by a_createdTime)first_time
from
(
select userID,date(from_unixtime(createdTime))a_createdTime
from user_active_log
group by userId,a_createdTime
)t0;
Step3:
创建辅助列delta_time,用登录日期列减去最早登录日期first_time,得到留存天数
select
userID,
a_createdTime,
first_value(a_createdTime)over(partition by userId order by a_createdTime)first_time,
datediff(a_createdTime, first_value(a_createdTime) over(partition by userI
d order by a_createdTime ))delta_time
from
(
select userID,date(from_unixtime(createdTime))a_createdTime
from user_active_log
group by userId,a_createdTime
)t0;
Step4:
按first_time列统计不同留存天数对应的次数和 即 某日的近N日留存数
select
t1.first_time,
sum( case when t1.delta time =1 then 1 else 0 end) day_1,
sum( case when t1.delta time =2 then 1 else 0 end)day_2,
sum( case when t1.delta time =3 then 1 else 0 end) day_3,
sum( case when t1.delta time =5 then 1 else 0 end)day_5,
sum( case when t1.delta time =7 then 1 else 0 end) day_7,
sum(case when t1.delta time = 30 then 1 else 0 end )day_30
from
(
select
userID,
a_createdTime,
first_value(a_createdTime)over(partition by userId order by a_createdTime)first_time,
datediff(a_createdTime, first_value(a_createdTime) over(partition by userI
d order by a_createdTime ))delta_time
from
(
select userID,date(from_unixtime(createdTime))a_createdTime
from user_active_log
group by userId,a_createdTime
)t0
)t1
group by t1.first_time
order by t1.first_time;
Step5:
用某日的近N日留存数除以首日登录人数即留存率
select
t1.first_time,
sum( case when t1.delta time =1 then 1 else 0 end)/count(distinct ti.userID) day_1,
sum( case when t1.delta time =2 then 1 else 0 end)/count(distinct ti.userID) day_2,
sum( case when t1.delta time =3 then 1 else 0 end)/count(distinct ti.userID) day_3,
sum( case when t1.delta time =5 then 1 else 0 end)/count(distinct ti.userID) day_5,
sum( case when t1.delta time =7 then 1 else 0 end)/count(distinct ti.userID) day_7,
sum(case when t1.delta time = 30 then 1 else 0 end )/count(distinct ti.userID) day_30
from
(
select
userID,
a_createdTime,
first_value(a_createdTime)over(partition by userId order by a_createdTime)first_time,
datediff(a_createdTime, first_value(a_createdTime) over(partition by userI
d order by a_createdTime ))delta_time
from
(
select userID,date(from_unixtime(createdTime))a_createdTime
from user_active_log
group by userId,a_createdTime
)t0
)t1
group by t1.first_time
order by t1.first_time;
总结
待补充