本系列包含:
窗口函数实战(一)
本篇文章是笔者在牛客网上摘选的几道比较有挑战性的SQL窗口函数编程题。
1. 试卷完成数同比2020年的增长率及排名变化【困难】(原题链接)
我们首先来分析一下题目的要求:
- 时间上,是
2020年上半年
和2021年上半年
。 - 是已完成的试卷(submit_time 不为空),不考虑未完成的试卷。
排名变化
中的排名,是该类型试卷的当年完成情况的排名。
下面提供一种比较优雅的解法,视觉上也很容易理解。
- 将2020年和2021年的数据先分成两张表,在通过 inner join 连接,得到的就是能进行同比的试卷类别(tag)。这样相同类别的试卷的结果都在一行,方便计算增长率、排名变化。
- 有时候适当使用 with as 语句能让代码的逻辑更清晰。
with a as (
select tag, count(submit_time) as exam_cnt,
rank() over (order by count(submit_time) desc) as exam_cnt_rank_20
from exam_record
left outer join examination_info using(exam_id)
where date_format(submit_time,'%Y%m') between '202001' and '202006'
group by tag
),
b as (
select tag, count(submit_time) as exam_cnt,
rank() over (order by count(submit_time) desc) as exam_cnt_rank_21
from exam_record
left outer join examination_info using(exam_id)
where date_format(submit_time,'%Y%m') between '202101' and '202106'
group by tag
)
select a.tag, a.exam_cnt, b.exam_cnt,
concat(round((b.exam_cnt-a.exam_cnt) / a.exam_cnt * 100, 1),'%') as growth_rate,
exam_cnt_rank_20, exam_cnt_rank_21,
cast(exam_cnt_rank_21 as signed) - cast(exam_cnt_rank_20 as signed) as rank_delta
from a
inner join b using(tag)
order by growth_rate desc, exam_cnt_rank_21 desc
这一道题的有几个需要注意的点:
- 如何构建查询表,能比较方便地去计算同期(2020上半年、2021上半年)的增长率、排名变化。
- 窗口函数得到的排名,是 unsigned 类型,需要转换成 signed 类型,再进行计算。
此处再提供另外一种解法。
SELECT tag, exam_cnt_20, exam_cnt_21,
CONCAT(ROUND((exam_cnt_21 - exam_cnt_20) / exam_cnt_20 * 100, 1), '%') as growth_rate,
exam_cnt_rank_20, exam_cnt_rank_21,
CAST(exam_cnt_rank_21 as SIGNED) - CAST(exam_cnt_rank_20 AS SIGNED) rank_delta
FROM (
SELECT tag, exam_cnt_20, exam_cnt_21,
RANK() OVER (ORDER BY exam_cnt_20 DESC) exam_cnt_rank_20,
RANK() OVER (ORDER BY exam_cnt_21 DESC) exam_cnt_rank_21
FROM (
SELECT tag,
SUM(IF(DATE_FORMAT(submit_time, '%Y-%m') BETWEEN '2020-01' AND '2020-06', 1, 0)) exam_cnt_20,
SUM(IF(DATE_FORMAT(submit_time, '%Y-%m') BETWEEN '2021-01' AND '2021-06', 1, 0)) exam_cnt_21
FROM exam_record
LEFT JOIN examination_info USING(exam_id)
GROUP BY tag
) a
) b
WHERE exam_cnt_20 != 0 AND exam_cnt_21 != 0
ORDER BY growth_rate DESC, exam_cnt_rank_21 DESC
第二种解法给我们提供的思路是:
- 有时候计数不一定要用
count
,也可以用sum
和if
的组合。 - 这里的几个
select
嵌套子查询,在思路上也还是很清晰的。
其实这一题最核心的点,就是在一开始的时候要把 2020 年和 2021 年分成两列对待。我一开始参考了题目下面的 “解释” 那张表,反而造成后续的计算很麻烦。
2.未完成率较高的50%用户近三个月答卷情况【困难】(原题链接)
我们首先来分析一下题目要求:
- SQL 试卷未完成率较高的前 50%用户。注意:不是未完成率超过 50% 的用户。
- 用户等级为 6 或 7 级。
- 有作答记录(start_time)的近 3 个月。
这一题的查询限制条件本身就非常多,而又需要用到至少两个窗口函数,所以层层嵌套几乎是难免的,只能尽量通过调整语句格式让代码更加易懂一些。在这里,给出一种解法。
with a as (
select exam_record.uid, level,
count(start_time) - count(submit_time) as incomplete_cnt,
count(start_time) as total_cnt,
(count(start_time) - count(submit_time)) / count(start_time) as incomplete_rate
from exam_record
left outer join examination_info on exam_record.exam_id = examination_info.exam_id
left outer join user_info on exam_record.uid = user_info.uid
where tag = 'SQL'
group by exam_record.uid
order by exam_record.uid
),
b as (
select *, percent_rank() over (order by incomplete_rate) as percent_rank_
from a
),
c as (
select *
from b
where percent_rank_ >= 0.5 and level in (6,7)
)
select uid, start_month, count(start_month) as total_cnt, count(submit_time) as complete_cnt
from (
select uid, start_month, submit_time,
dense_rank() over (partition by uid order by start_month desc) as dense_rank_
from (
select uid, date_format(start_time,'%Y%m') as start_month, submit_time
from exam_record
where uid in ( select uid from c )
) d
) e
where dense_rank_ <= 3
group by uid, start_month
order by uid, start_month
- with 后面的 a、b、c 三张表,都是为了找到 SQL 试卷未完成率较高的前 50%,且等级为 6 或 7 级的用户 uid
- 然后再对时间排序(dense_rank),找到近 3 个月的答题记录,进行统计即可。