原题链接:http://practice.atguigu.cn/#/question/28/desc?qType=SQL
题目需求
从用户登录明细表(user_login_detail)中首次登录算作当天新增,第二天也登录了算作一日留存
期望结果如下:
first_login(注册时间) | register(新增用户数) | retention<decimal(16,2)>(留存率) |
---|---|---|
2021-09-21 | 1 | 0.00 |
2021-09-22 | 1 | 0.00 |
2021-09-23 | 1 | 0.00 |
2021-09-24 | 1 | 0.00 |
2021-09-25 | 1 | 0.00 |
2021-09-26 | 1 | 0.00 |
2021-09-27 | 1 | 0.00 |
2021-10-04 | 2 | 0.50 |
2021-10-06 | 1 | 0.00 |
需要用到的表:
用户登录明细表:user_login_detail
user_id(用户id) | ip_address(ip地址) | login_ts(登录时间) | logout_ts(登出时间) |
---|---|---|---|
101 | 180.149.130.161 | 2021-09-21 08:00:00 | 2021-09-27 08:30:00 |
102 | 120.245.11.2 | 2021-09-22 09:00:00 | 2021-09-27 09:30:00 |
103 | 27.184.97.3 | 2021-09-23 10:00:00 | 2021-09-27 10:30:00 |
解题思路
本题与第05题类似,由于该题需要统计的是每天新用户数量、新用户的第一天留存率,因此可以简单的把每个用户的首次登录日期查询出来,随后把所有登录日期与首单日期进行作差对比,差值=1则说明存在1日留存。1和2是该思路下的解法:
1.笛卡尔积计算
SELECT t1.first_login,
COUNT(DISTINCT t1.user_id) AS register,
cast(COUNT(DISTINCT t2.user_id)/COUNT(DISTINCT t1.user_id) AS decimal(16,2)) AS retention
FROM
(
SELECT user_id,
MIN(date(login_ts)) AS first_login
FROM user_login_detail
GROUP BY user_id
) t1
LEFT JOIN
(
SELECT user_id,
date(login_ts) AS login_date
FROM user_login_detail
GROUP BY user_id,
date(login_ts)
) t2
ON t1.user_id = t2.user_id AND DATEDIFF(t2.login_date, t1.first_login) = 1
GROUP BY t1.first_login
2.开窗取出首单日期对所有记录作差
SELECT first_login,
COUNT(DISTINCT IF(login_date = first_login,user_id,NULL)) AS register,
cast(COUNT(DISTINCT IF(DATEDIFF(login_date,first_login) = 1,user_id,NULL)) /COUNT(DISTINCT user_id) AS decimal(16,2)) AS retention
FROM
(
SELECT user_id,
login_date,
MIN(login_date) OVER (PARTITION BY user_id ORDER BY login_date) AS first_login -- first_value(create_date)也可以
FROM
(
SELECT user_id,
date(login_ts) AS login_date
FROM user_login_detail
GROUP BY user_id,
date(login_ts)
) t1
) t2
GROUP BY first_login
除常规思路外,本题实质上是连续区间/留存的问题,因此可以使用3、4两种连续区间的处理方法。
3.lead()/lag()开窗取前后n条并作差
可以取出每个用户的第一条记录,使用lead()向后取第2条,也可以取出每个用户的第二条记录,使用lag()向前取第一条,若第1、2两次登录是连续两天,则日期差值为1。
SELECT login_date AS first_login,
COUNT(DISTINCT user_id) AS register,
cast(COUNT(DISTINCT IF(DATEDIFF(next_date,login_date) = 1,user_id,NULL)) /COUNT(DISTINCT user_id) AS decimal(16,2)) AS retention
FROM
(
SELECT user_id,
login_date,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date ASC) AS rn,
lead(login_date,1,'9999-12-31') OVER (PARTITION BY user_id ORDER BY login_date ASC) AS next_date
FROM
(
SELECT user_id,
date(login_ts) AS login_date
FROM user_login_detail
GROUP BY user_id,
date(login_ts)
) t1
) t2
WHERE rn = 1
GROUP BY login_date
4.row_number()开窗做标记取出连续区间
通过row_number()函数可以得到每个用户每个登录日期的次序号,以此为偏移量对登录日期进行处理,得到一个基准日期flag,若存在连续日期的情况,则基准日期会相同,可以通过flag分组内记录条数判断是否存在连续登录行为。
在本题中,由于需要考虑注册时间,因此还需要将基准日期和注册日期做对比。
SELECT t1.first_login,
COUNT(DISTINCT t1.user_id) AS register,
cast(COUNT(DISTINCT t4.user_id)/COUNT(DISTINCT t1.user_id) AS decimal(16,2)) AS retention
FROM
(
SELECT user_id,
MIN(date(login_ts)) AS first_login
FROM user_login_detail
GROUP BY user_id
) t1
LEFT JOIN
(
SELECT user_id,
DATE_SUB(login_date,rn - 1) AS flag
FROM
(
SELECT user_id,
login_date,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date ASC) AS rn
FROM
(
SELECT user_id,
date(login_ts) AS login_date
FROM user_login_detail
GROUP BY user_id,
date(login_ts)
) t2
) t3
GROUP BY user_id,
DATE_SUB(login_date,rn - 1)
HAVING COUNT(1) >= 2
) t4
ON t1.user_id = t4.user_id AND t1.first_login = t4.flag
GROUP BY t1.first_login
5.row_number()开窗取出前两次登录日期并作差
解法5是对解法4的简化,可以直接限制每个用户的前n条下单记录,随后将第1条和第n条作差,假如差值为n-1,则说明用户自注册日期开始连续n天存在登录行为。
SELECT first_date AS first_login,
COUNT(DISTINCT user_id) AS register,
cast(COUNT(DISTINCT IF(DATEDIFF(second_date,first_date) = 1,user_id,NULL))/COUNT(DISTINCT user_id) AS decimal(16,2)) AS retention
FROM
(
SELECT user_id,
MIN(login_date) AS first_date,
MAX(login_date) AS second_date
FROM
(
SELECT user_id,
login_date,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date ASC) AS rn
FROM
(
SELECT user_id,
date(login_ts) AS login_date
FROM user_login_detail
GROUP BY user_id,
date(login_ts)
) t1
) t2
WHERE rn <= 2
GROUP BY user_id
) t3
GROUP BY first_date