本系列包含:
窗口函数实战(二)
本篇文章是笔者在牛客网上摘选的几道比较有挑战性的SQL窗口函数编程题。
1.每月及截止当月的答题情况(原题链接)
我们首先来分析一下题目的要求:
- 用户有作答记录,所以
submit_time
即使为空也是满足要求的。 - 月活跃用户数。这个比较好实现,注意去重。
- 月新增用户数。这个要求是这一题的难点。我们可以生成一列,用于记录该用户是否为新增用户,其判断条件就是对于某用户的某条试卷作答记录,该用户是否有更早的作答时间。
- 截止当月的单月最大新增用户数、截止当月的累积用户数。这两个都可以根据月新增用户数得到,比较容易实现。
下面提供一种解法。
select start_month, count(distinct uid) as mau,
sum(new_customer) as month_add_uv,
max(sum(new_customer)) over (order by start_month) as max_month_add_uv,
sum(sum(new_customer)) over (order by start_month) as cum_sum_uv
from (
select uid, date_format(start_time,'%Y%m') as start_month,
if(start_time = min(start_time) over (partition by uid), 1, 0) as new_customer
from exam_record
) a
group by start_month
order by start_month
2.每份试卷每月作答数和截止当月的作答总数(原题链接)
这一题只给了一个表,问法也比较简单:每份试卷每月作答数和截止当月的作答总数。
- 有作答记录的月份,
start_time
不为空即可。 - 每月的试卷作答数。
- 截至当月的作答总数。
以下是我第一次解答时的做法。
with b as (
select exam_id, start_month,
count(start_month) over (partition by exam_id, start_month order by exam_id, start_month) as count_,
row_number() over (partition by exam_id, start_month) as row_number_
from (
select exam_id, date_format(start_time,'%Y%m') as start_month
from exam_record
) a
)
select exam_id, start_month, count_,
sum(count_) over (partition by exam_id order by exam_id, start_month)
from b
where row_number_ = 1
思考了一下,觉得还是比较繁(因为包含了一个删除重复行的操作),应该在一开始就使用 group by
,改写后如下。
select exam_id,
date_format(start_time,'%Y%m') as start_month,
count(start_time) as month_cnt,
sum(count(start_time)) over (partition by exam_id order by exam_id, date_format(start_time,'%Y%m'))
from exam_record
group by exam_id, date_format(start_time,'%Y%m')
3.对试卷得分做min-max归一化(原题链接)
这一道题的查询条件比较多,我们一起来分析一下。
- 试卷的难度为
difficult
。 - 将
score
标准化, x − m i n ( x i ) m a x ( x i ) − m i n ( x i ) \frac{x-min(x_i)}{max(x_i)-min(x_i)} max(xi)−min(xi)x−min(xi),所以score
不能为空。 - 标准化的条件是:在每份试卷作答记录内进行标准化,说明要对
exam_id
分组。 - 如果某个试卷作答记录中只有一个得分,那么无需使用公式,归一化并缩放后分数仍为原分数。
- 输出用户ID、试卷ID、归一化后分数平均值(某用户可能多次作答同一份试卷,计算的是这个平均值)。
下面给出一种解法。
select uid, exam_id, round(avg(standard_score)) as avg_new_score
from(
select *, ( case when max_score != min_score then (score - min_score)/(max_score - min_score) * 100
else score end ) as standard_score
from (
select uid, exam_id, score,
max(score) over (partition by exam_id) as max_score,
min(score) over (partition by exam_id) as min_score
from exam_record
left outer join examination_info using(exam_id)
where difficulty = 'hard' and score is not null
) a
) b
group by uid, exam_id
order by exam_id, avg_new_score desc
判断是否 “某个试卷作答记录中只有一个得分” ,除了可以用 case when
, 还可以像下面这么写。
if( max_score = min_score, score, (score - min_score)/(max_score - min_score) * 100 ) as standard_score
当我们练过一些牛客上的 SQL 题会发现,即使是困难的题目,只要充分理解题意,罗列清楚查询条件,通过窗口函数、多层嵌套、表连接、条件语句等知识组合,我们也能很顺利地取到想要的数。
除了准,还要快,所以当我们能确保取数的正确性后,就要进一步考虑如何能更快速地取数,也就是如何优化你的 SQL。道阻且长,与君共勉!